jmlr jmlr2005 jmlr2005-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Georgios Sigletos, Georgios Paliouras, Constantine D. Spyropoulos, Michalis Hatzopoulos
Abstract: This article investigates the effectiveness of voting and stacked generalization -also known as stacking- in the context of information extraction (IE). A new stacking framework is proposed that accommodates well-known approaches for IE. The key idea is to perform cross-validation on the base-level data set, which consists of text documents annotated with relevant information, in order to create a meta-level data set that consists of feature vectors. A classifier is then trained using the new vectors. Therefore, base-level IE systems are combined with a common classifier at the metalevel. Various voting schemes are presented for comparing against stacking in various IE domains. Well known IE systems are employed at the base-level, together with a variety of classifiers at the meta-level. Results show that both voting and stacking work better when relying on probabilistic estimates by the base-level systems. Voting proved to be effective in most domains in the experiments. Stacking, on the other hand, proved to be consistently effective over all domains, doing comparably or better than voting and always better than the best base-level systems. Particular emphasis is also given to explaining the results obtained by voting and stacking at the meta-level, with respect to the varying degree of similarity in the output of the base-level systems. Keywords: stacking, voting, information extraction, cross-validation 1
Reference: text
sentIndex sentText sentNum sentScore
1 GR Department of Informatics and Telecommunications University of Athens Panepistimiopolis, 157 71, Athens, Greece Editor: William Cohen Abstract This article investigates the effectiveness of voting and stacked generalization -also known as stacking- in the context of information extraction (IE). [sent-10, score-0.312]
2 A new stacking framework is proposed that accommodates well-known approaches for IE. [sent-11, score-0.363]
3 Various voting schemes are presented for comparing against stacking in various IE domains. [sent-15, score-0.536]
4 Results show that both voting and stacking work better when relying on probabilistic estimates by the base-level systems. [sent-17, score-0.536]
5 Particular emphasis is also given to explaining the results obtained by voting and stacking at the meta-level, with respect to the varying degree of similarity in the output of the base-level systems. [sent-20, score-0.556]
6 Stacked generalization or stacking (Wolpert, 1992) is a common scheme that deals with the task of learning a meta-level classifier to combine the predictions of multiple base-level classifiers. [sent-25, score-0.503]
7 The success of stacking arises from its ability to exploit the diversity in the predictions of base-level classifiers and thus predicting with higher accuracy at meta-level. [sent-26, score-0.466]
8 Spyropoulos and Michalis Hatzopoulos SIGLETOS, PALIOURAS, SPYROPOULOS AND HATZOPOULOS place when voting on the predictions of multiple classifiers. [sent-28, score-0.226]
9 Voting is typically used as a baseline against which the performance of stacking is compared. [sent-29, score-0.363]
10 Research on voting and stacking has primarily focused on classification. [sent-30, score-0.536]
11 x n > , the predictions of the base-level classifiers form a new feature vector, which is assigned the class value y either by the meta-level classifier or by voting. [sent-41, score-0.21]
12 In this article we investigate the effectiveness of voting and stacking on the task of Information Extraction (IE). [sent-43, score-0.574]
13 IE is a form of shallow text processing that involves the population of a predefined template with relevant fragments extracted from a text document. [sent-44, score-0.537]
14 The key idea behind combining a set of IE systems through stacking is to learn a common meta-level classifier, such as a decision tree or a naiveBayes classifier, based on the output of the IE systems, towards higher extraction performance. [sent-49, score-0.489]
15 In order to apply voting and stacking to IE, the base-level classifiers should be normally replaced by systems that model IE as a classification task. [sent-51, score-0.67]
16 A typical IE system is trained using a set of sample documents, paired with templates that are filled with relevant text fragments from the documents. [sent-54, score-0.341]
17 This way of modelling the IE task, however, results in an enormous increase in the number of candidate text fragments, where only the small number of annotated fragments is considered as positive examples, while all the other fragments are considered as negative examples during training. [sent-56, score-0.359]
18 Table 1 shows the examples that are constructed from a hypothetical text fragment within a page describing laptop products. [sent-57, score-0.531]
19 An indicative example of recognizing an instance of the field ram (highlighted in bold) from a page that describes a laptop product and the set of examples it generates. [sent-59, score-0.491]
20 Although the size of the candidate text fragments can be somehow reduced by using various heuristics (Freitag, 2000), modelling IE in this manner, does not seem natural. [sent-60, score-0.214]
21 The task here is to recognize starting and ending token boundaries of relevant fragments within a document and then extract the enclosed content. [sent-65, score-0.216]
22 (a) Part of a Web page describing laptop products (b) The hand-filled template for this page. [sent-67, score-0.346]
23 Recent effort on adaptive IE (Ciravegna, 2001; Ciravegna and Lavelli, 2003), motivates the development of IE systems that can handle different text types, from rigidly structured to almost free text -where common wrappers fail- including mixed types. [sent-91, score-0.212]
24 < f > and < /f > tags for each relevant field f ∈ { f 1 . [sent-102, score-0.261]
25 Despite its simplicity, stacking offers the advantage of being highly extensible to more algorithms at both base-level and meta-level, regardless of their internal structure. [sent-119, score-0.363]
26 Specifically, for each relevant field f ∈ { f 1 . [sent-122, score-0.239]
27 In case of more than one predictions of the field f for a fragment t ( s, e) , a combined probability is estimated for < t(s, e), f > using (1). [sent-126, score-0.526]
28 PC =1 − ∏ (1 − p j ), (1) j where P C is the combined probabilistic estimate that the text fragment t ( s, e) belongs to the field f , and p j the probabilistic estimate that some IE system E j has predicted the field f for t ( s, e) . [sent-127, score-0.834]
29 Actually, P C measures the probability that at least one of those IE systems that have predicted the field f for t ( s, e) , has predicted correctly, which equals the probability that not all predictions for f are wrong. [sent-128, score-0.395]
30 For example, a page in the domain of computer science (CS) courses should describe only one course, and thus should contain only one instance of the field course title. [sent-131, score-0.24]
31 Assuming in Table 3 that one hypothetical IE system predicts “256 MB” and “1 GB” as ram instances, while another system predicts only “256 MB” as ram. [sent-134, score-0.242]
32 Supposing that the correct instance is <"256 MB" , ram > The OPD field constraint, though useful in certain cases, is restrictive for IE in general and does not hold for all relevant fields. [sent-143, score-0.382]
33 For example, a Web page may describe more than one laptop products, and thus more than one ram instances may exist. [sent-144, score-0.284]
34 The motivation behind mapping confidence scores to probabilistic estimates is that confidence scores are not always reliable, since incorrect matches may be assigned high scores. [sent-149, score-0.22]
35 Thus voting on different IE systems using confidence scores may not always be reliable. [sent-150, score-0.289]
36 The concept of the merged template is introduced, which is important for combining different IE systems either through voting or stacking. [sent-158, score-0.506]
37 3, against which the performance of stacking for IE will be compared. [sent-161, score-0.363]
38 LN be a set of N learning algorithms, designed for IE, which are given a corpus D of training documents, annotated with relevant field instances. [sent-166, score-0.268]
39 We suggest in this article that a merged template can be constructed from T 1 . [sent-183, score-0.346]
40 T N as follows: all text fragments t ( s, e) identified by E 1 . [sent-186, score-0.317]
41 Duplicate fragments are removed: two fragments differ if either their start or end boundary differs. [sent-193, score-0.232]
42 E N are collected and inserted together with the correct field in the template. [sent-197, score-0.245]
43 If some IE system does not predict a field for a text fragment, then the corresponding cell in the merged template is empty. [sent-198, score-0.614]
44 If a text fragment does not exist in the hand-filled template, then the corresponding cell in the last column is also empty. [sent-199, score-0.343]
45 Table 4 shows an illustrative example of a merged template that has been constructed by the output T 1 , T 2 of two IE systems E 1 , E 2 , for the page of Table 2(a). [sent-200, score-0.386]
46 Each entry corresponds to a text fragment that has been identified by at least one system. [sent-204, score-0.494]
47 For two text fragments (“TransPort ZX”, “1GB”) the predicted fields by E 1 and E 2 differ. [sent-206, score-0.366]
48 Comparing to the hand-filled template of Table 2(b), we conclude that “TransPort ZX” has been correctly identified as model only by E 1 , while E 2 identified the same fragment as manuf (the manufacturer of the laptop). [sent-207, score-0.729]
49 On the other hand, the fragment “1GB” does not exist in the handfilled template. [sent-208, score-0.266]
50 Therefore, the fields predicted by the two systems for this fragment are false. [sent-209, score-0.464]
51 Furthermore, some text fragments have been identified by only one of the two IE systems. [sent-210, score-0.317]
52 The fragment “15''” has been identified only by E 1 , while the fragment “40 GB” has been identified only by E 2 . [sent-211, score-0.78]
53 The desirable result is to automatically fill the last column in the merged template of Table 4 with the correct fields. [sent-214, score-0.348]
54 In other words, we would like to assign the correct field to each text fragment that has been identified by at least one base-level system. [sent-215, score-0.694]
55 2 Majority Voting A simple idea for combining the predictions of different IE systems is to use majority voting: for each entry in the merged template, we count the predicted fields by the available systems and select the field with the highest count. [sent-217, score-0.652]
56 Note that Table 4 contains missing values, reflecting the natural fact that some system may not have predicted a field for a text fragment that has been identified by another system. [sent-219, score-0.795]
57 For example, if some system predicts an incorrect field f for a text fragment t ( s, e) , while the remaining systems do not predict any field at all, then ignoring missing values during voting harms precision, since the incorrect field is returned. [sent-221, score-1.272]
58 An alternative is to record a missing value as “false”, providing evidence that no field should be predicted for t ( s, e ) . [sent-222, score-0.306]
59 If the value with the highest count is “false” then no field is assigned to t ( s, e) . [sent-223, score-0.207]
60 If, however, f is the correct field for t ( s, e) , interpreting the missing predictions by the remaining systems as “false” values harms overall extraction performance, since the correct field is rejected. [sent-224, score-0.658]
61 Therefore, two different settings of majority voting are defined, depending on whether missing values are ignored or encoded as “false” values that indicate rejection of prediction. [sent-225, score-0.265]
62 3 Voting Using Probabilities The voting with probabilities scheme that is presented in this section shares many features with multistrategy learning, as described in (Freitag, 2000) and was briefly outlined in Section 2. [sent-227, score-0.271]
63 Multistrategy learning considers each field in isolation during combination and relies on the OPD constraint for improving the extraction accuracy, as demonstrated by the example of Table 1759 SIGLETOS, PALIOURAS, SPYROPOULOS AND HATZOPOULOS 3. [sent-232, score-0.269]
64 On the other hand, voting using probabilities takes place on a merged template, like the one in Table 4, while no OPD assumption is required for any relevant field. [sent-233, score-0.328]
65 This allows the case of contradictory field predictions among different systems during combination, as demonstrated in Table 4, where the fragment “1GB” has been identified as ram by the first system and as HDcapacity by the second one. [sent-234, score-0.82]
66 2 two different settings of majority voting were defined, depending on whether absence of prediction by some system for a text fragment, i. [sent-238, score-0.291]
67 Thus, two different settings for voting using probabilities are defined as follows: In the first setting, missing values are ignored, similar to the first setting of majority voting. [sent-243, score-0.261]
68 Given a fragment t ( s, e) , the field f with the highest probabilistic estimate by those systems that have predicted a field for t ( s, e) is returned. [sent-244, score-0.76]
69 Then a new stacking framework for IE is presented, along with an extension that relies on using probabilistic estimates on the output of the base-level systems. [sent-253, score-0.383]
70 1 Motivation for Performing Learning Examining the merged template of Table 4, we wonder whether we can learn to predict the correct field, based on the fields predicted by the available systems, rather than simply voting. [sent-255, score-0.501]
71 For example, if a system correctly predicts ram for the hypothetical fragment “1,5 GB”, while the other systems erroneously predict HDcapacity, then voting chooses the latter value. [sent-257, score-0.66]
72 Therefore, it would be desirable to perform learning in order to induce a rule of the form: if the first IE system predicts “ram” and the other systems predict “HDcapacity”, then the correct field is “ram”. [sent-258, score-0.298]
73 The idea suggested in this article is to create a feature vector for every row entry of the merged template, i. [sent-260, score-0.208]
74 for each text fragment that has been identified by at least one base-level system. [sent-262, score-0.467]
75 Table 5 shows the new feature vectors created by the merged template of Table 4. [sent-263, score-0.328]
76 , procSpeed, ram, HDcapacity, HDcapacity, Class model screenSize screenType false procName procSpeed ram false HDcapacity Table 5. [sent-268, score-0.281]
77 Feature vectors created by the merged template of Table 4. [sent-269, score-0.308]
78 If a text fragment does not exist in the hand-filled template, the class attribute of the corresponding vector takes the value “false” that indicates rejection of prediction. [sent-272, score-0.372]
79 Having specified the format of the feature vector, the remaining issue is to construct the full set of vectors for training the meta-level classifier using cross-validation, as described for the stacking framework in Section 2. [sent-274, score-0.47]
80 This disparity between base-level and meta-level data sets is handled by sampling from documents during cross-validation, rather than from feature vectors as in stacking for classification. [sent-277, score-0.419]
81 2 Stacking Using Nominal Values The key idea behind stacking for IE, is to learn a meta-level classifier based on the output of baselevel systems via cross-validation as follows: At the jth fold, j = 1. [sent-279, score-0.514]
82 Figure 2 presents an algorithmic description of the new stacking framework for IE. [sent-306, score-0.363]
83 A vector classified as “false” indicates that the corresponding fragment t ( s, e) does not exist in the hand-filled template, and thus the available base-level systems should not have predicted a field for it (for example the fragment “1 GB” in Table 5). [sent-308, score-0.858]
84 f Q , false} = the correct field for t ( s, e) MD j = MD j ∪ vector < f 1 ,. [sent-327, score-0.227]
85 1761 SIGLETOS, PALIOURAS, SPYROPOULOS AND HATZOPOULOS The key difference among stacking for IE and common stacking is that cross-validation operates on text documents paired with hand-filled templates, instead of feature vectors labelled with class values. [sent-335, score-0.859]
86 This removes the constraint of performing common classification at base-level, thus allowing the application of stacking to IE. [sent-336, score-0.422]
87 The size of the meta-level data set is not a-priori known in stacking for IE, unlike common stacking where there is a one-to-one correspondence between base-level and meta-level vectors. [sent-344, score-0.726]
88 In IE, however, a text fragment that is relevant to our task may not have been identified by any of the available base-level systems. [sent-352, score-0.499]
89 In that case, there is no possibility of identifying that fragment at meta-level, since no feature vector is created and thus resulting in loss of information. [sent-353, score-0.286]
90 This observation suggests that IE systems biased towards recall (percentage of the annotated field instances that were identified), should be generally preferred for combination, expecting to reduce the loss of information at meta-level. [sent-354, score-0.261]
91 Moreover, a meta-level vector in common stacking for classification does not contain missing values, since each base-level classifier predicts a nominal class value or a probability distribution over the relevant classes. [sent-355, score-0.609]
92 On the other hand, a meta-level vector in stacking for IE may contain missing values, as shown in Table 5, since some system may not have predicted a field for a fragment that has been identified by another system. [sent-356, score-1.081]
93 Another issue in stacking for IE is the choice of J in the J -fold cross-validation process depicted in Figure 2. [sent-366, score-0.363]
94 E N are used to identify relevant field instances and fill the corresponding templates T 1 . [sent-394, score-0.302]
95 f Q then the corresponding instance < t(s, e), f > is inserted in the final template for d . [sent-416, score-0.24]
96 At runtime, the stacking framework for IE is graphically depicted in Figure 3. [sent-419, score-0.363]
97 1762 INFORMATION EXTRACTION USING VOTING AND STACKING E1 E2 New document d T1 T2 Merged template Feature vectors CM Final template T … EN TN Figure 3. [sent-420, score-0.405]
98 In contrast, both input and output in the runtime stacking for classification architecture of Figure 1(b), consist of a single feature vector. [sent-423, score-0.495]
99 4 Stacking Using Probabilities A straightforward extension of stacking with nominal values is to rely on the confidence scores by the base-level systems that have been converted into probabilistic estimates of correctness. [sent-425, score-0.479]
100 The new framework is described as follows: • Instead of predicting one of the Q relevant fields for each fragment t ( s, e) , each system generates a confidence score c k for the predicted field f k . [sent-426, score-0.766]
wordName wordTfidf (topN-words)
[('ie', 0.628), ('stacking', 0.363), ('fragment', 0.266), ('field', 0.207), ('template', 0.185), ('voting', 0.173), ('laptop', 0.128), ('identified', 0.124), ('merged', 0.123), ('ram', 0.123), ('fields', 0.118), ('fragments', 0.116), ('freitag', 0.116), ('hdcapacity', 0.098), ('multistrategy', 0.098), ('classifier', 0.087), ('gb', 0.083), ('opd', 0.079), ('false', 0.079), ('text', 0.077), ('procname', 0.069), ('procspeed', 0.069), ('screentype', 0.069), ('sigletos', 0.069), ('confidence', 0.066), ('mb', 0.063), ('extraction', 0.062), ('ciravegna', 0.059), ('hatzopoulos', 0.059), ('paliouras', 0.059), ('spyropoulos', 0.059), ('transport', 0.059), ('classification', 0.059), ('predicted', 0.055), ('intel', 0.053), ('predictions', 0.053), ('pentium', 0.051), ('md', 0.051), ('classifiers', 0.05), ('screensize', 0.049), ('tft', 0.049), ('missing', 0.044), ('templates', 0.043), ('zx', 0.04), ('classified', 0.039), ('georgios', 0.039), ('kushmerick', 0.039), ('lm', 0.039), ('stacked', 0.039), ('article', 0.038), ('final', 0.037), ('documents', 0.036), ('document', 0.035), ('runtime', 0.033), ('token', 0.033), ('processor', 0.033), ('mhz', 0.033), ('wrappers', 0.033), ('page', 0.033), ('relevant', 0.032), ('athens', 0.03), ('bwi', 0.03), ('filled', 0.03), ('manuf', 0.03), ('predefined', 0.03), ('rejection', 0.029), ('annotated', 0.029), ('hypothetical', 0.027), ('entry', 0.027), ('table', 0.026), ('web', 0.025), ('scores', 0.025), ('systems', 0.025), ('defined', 0.025), ('populated', 0.025), ('predicts', 0.024), ('system', 0.022), ('tags', 0.022), ('trained', 0.021), ('modelling', 0.021), ('correct', 0.02), ('feature', 0.02), ('output', 0.02), ('califf', 0.02), ('constantine', 0.02), ('fill', 0.02), ('foreach', 0.02), ('greece', 0.02), ('harms', 0.02), ('intensified', 0.02), ('michalis', 0.02), ('muslea', 0.02), ('screen', 0.02), ('sdram', 0.02), ('shallow', 0.02), ('sonderland', 0.02), ('majority', 0.019), ('behind', 0.019), ('motivation', 0.019), ('inserted', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 21 jmlr-2005-Combining Information Extraction Systems Using Voting and Stacked Generalization
Author: Georgios Sigletos, Georgios Paliouras, Constantine D. Spyropoulos, Michalis Hatzopoulos
Abstract: This article investigates the effectiveness of voting and stacked generalization -also known as stacking- in the context of information extraction (IE). A new stacking framework is proposed that accommodates well-known approaches for IE. The key idea is to perform cross-validation on the base-level data set, which consists of text documents annotated with relevant information, in order to create a meta-level data set that consists of feature vectors. A classifier is then trained using the new vectors. Therefore, base-level IE systems are combined with a common classifier at the metalevel. Various voting schemes are presented for comparing against stacking in various IE domains. Well known IE systems are employed at the base-level, together with a variety of classifiers at the meta-level. Results show that both voting and stacking work better when relying on probabilistic estimates by the base-level systems. Voting proved to be effective in most domains in the experiments. Stacking, on the other hand, proved to be consistently effective over all domains, doing comparably or better than voting and always better than the best base-level systems. Particular emphasis is also given to explaining the results obtained by voting and stacking at the meta-level, with respect to the varying degree of similarity in the output of the base-level systems. Keywords: stacking, voting, information extraction, cross-validation 1
2 0.027965039 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
Author: Joseph F. Murray, Gordon F. Hughes, Kenneth Kreutz-Delgado
Abstract: We compare machine learning methods applied to a difficult real-world problem: predicting computer hard-drive failure using attributes monitored internally by individual drives. The problem is one of detecting rare events in a time series of noisy and nonparametrically-distributed data. We develop a new algorithm based on the multiple-instance learning framework and the naive Bayesian classifier (mi-NB) which is specifically designed for the low false-alarm case, and is shown to have promising performance. Other methods compared are support vector machines (SVMs), unsupervised clustering, and non-parametric statistical tests (rank-sum and reverse arrangements). The failure-prediction performance of the SVM, rank-sum and mi-NB algorithm is considerably better than the threshold method currently implemented in drives, while maintaining low false alarm rates. Our results suggest that nonparametric statistical tests should be considered for learning problems involving detecting rare events in time series data. An appendix details the calculation of rank-sum significance probabilities in the case of discrete, tied observations, and we give new recommendations about when the exact calculation should be used instead of the commonly-used normal approximation. These normal approximations may be particularly inaccurate for rare event problems like hard drive failures. Keywords: hard drive failure prediction, rank-sum test, support vector machines (SVM), exact nonparametric statistics, multiple instance naive-Bayes
3 0.024181549 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
4 0.024138046 1 jmlr-2005-A Bayes Optimal Approach for Partitioning the Values of Categorical Attributes
Author: Marc Boullé
Abstract: In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In this paper, we propose a new grouping method MODL1 founded on a Bayesian approach. The method relies on a model space of grouping models and on a prior distribution defined on this model space. This results in an evaluation criterion of grouping, which is minimal for the most probable grouping given the data, i.e. the Bayes optimal grouping. We propose new super-linear optimization heuristics that yields near-optimal groupings. Extensive comparative experiments demonstrate that the MODL grouping method builds high quality groupings in terms of predictive quality, robustness and small number of groups. Keywords: data preparation, grouping, Bayesianism, model selection, classification, naïve Bayes 1
5 0.024063375 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
6 0.023432879 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
7 0.022211662 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
8 0.021235036 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
9 0.017664047 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
10 0.016274493 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
11 0.015585806 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
12 0.014369037 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
13 0.013889803 72 jmlr-2005-What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks
14 0.013597711 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
15 0.013003636 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
16 0.012115067 36 jmlr-2005-Gaussian Processes for Ordinal Regression
17 0.011630906 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
18 0.010923137 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
19 0.010738623 44 jmlr-2005-Learning Module Networks
20 0.010021875 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
topicId topicWeight
[(0, 0.07), (1, 0.027), (2, 0.032), (3, -0.04), (4, 0.052), (5, -0.03), (6, -0.03), (7, 0.004), (8, 0.031), (9, -0.042), (10, 0.012), (11, -0.07), (12, 0.055), (13, -0.156), (14, 0.007), (15, -0.012), (16, 0.15), (17, -0.103), (18, 0.024), (19, 0.15), (20, 0.163), (21, 0.167), (22, -0.081), (23, 0.188), (24, -0.0), (25, 0.315), (26, 0.429), (27, 0.212), (28, -0.296), (29, -0.062), (30, -0.147), (31, 0.021), (32, -0.489), (33, 0.046), (34, -0.16), (35, -0.019), (36, -0.057), (37, -0.099), (38, -0.147), (39, -0.018), (40, 0.066), (41, -0.135), (42, -0.016), (43, 0.087), (44, -0.024), (45, -0.031), (46, 0.011), (47, 0.038), (48, 0.056), (49, -0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.99068713 21 jmlr-2005-Combining Information Extraction Systems Using Voting and Stacked Generalization
Author: Georgios Sigletos, Georgios Paliouras, Constantine D. Spyropoulos, Michalis Hatzopoulos
Abstract: This article investigates the effectiveness of voting and stacked generalization -also known as stacking- in the context of information extraction (IE). A new stacking framework is proposed that accommodates well-known approaches for IE. The key idea is to perform cross-validation on the base-level data set, which consists of text documents annotated with relevant information, in order to create a meta-level data set that consists of feature vectors. A classifier is then trained using the new vectors. Therefore, base-level IE systems are combined with a common classifier at the metalevel. Various voting schemes are presented for comparing against stacking in various IE domains. Well known IE systems are employed at the base-level, together with a variety of classifiers at the meta-level. Results show that both voting and stacking work better when relying on probabilistic estimates by the base-level systems. Voting proved to be effective in most domains in the experiments. Stacking, on the other hand, proved to be consistently effective over all domains, doing comparably or better than voting and always better than the best base-level systems. Particular emphasis is also given to explaining the results obtained by voting and stacking at the meta-level, with respect to the varying degree of similarity in the output of the base-level systems. Keywords: stacking, voting, information extraction, cross-validation 1
2 0.088245563 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
Author: Hyunsoo Kim, Peg Howland, Haesun Park
Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced. Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids
3 0.071851477 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.
4 0.071745649 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
Author: Neil Lawrence
Abstract: Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be nonlinearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets. Keywords: Gaussian processes, latent variable models, principal component analysis, spectral methods, unsupervised learning, visualisation
5 0.070259713 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
6 0.068968944 12 jmlr-2005-An MDP-Based Recommender System
7 0.065262489 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
8 0.05834873 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
9 0.052586183 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
10 0.050367333 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
11 0.049293783 1 jmlr-2005-A Bayes Optimal Approach for Partitioning the Values of Categorical Attributes
12 0.043760542 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
13 0.042210702 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
14 0.03996167 72 jmlr-2005-What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks
15 0.039075892 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
16 0.038818456 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
17 0.038537502 47 jmlr-2005-Learning from Examples as an Inverse Problem
18 0.036398288 8 jmlr-2005-Active Coevolutionary Learning of Deterministic Finite Automata
19 0.035731371 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
topicId topicWeight
[(17, 0.028), (19, 0.012), (36, 0.02), (37, 0.017), (42, 0.647), (43, 0.039), (52, 0.052), (59, 0.011), (70, 0.02), (88, 0.034), (94, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.91974503 21 jmlr-2005-Combining Information Extraction Systems Using Voting and Stacked Generalization
Author: Georgios Sigletos, Georgios Paliouras, Constantine D. Spyropoulos, Michalis Hatzopoulos
Abstract: This article investigates the effectiveness of voting and stacked generalization -also known as stacking- in the context of information extraction (IE). A new stacking framework is proposed that accommodates well-known approaches for IE. The key idea is to perform cross-validation on the base-level data set, which consists of text documents annotated with relevant information, in order to create a meta-level data set that consists of feature vectors. A classifier is then trained using the new vectors. Therefore, base-level IE systems are combined with a common classifier at the metalevel. Various voting schemes are presented for comparing against stacking in various IE domains. Well known IE systems are employed at the base-level, together with a variety of classifiers at the meta-level. Results show that both voting and stacking work better when relying on probabilistic estimates by the base-level systems. Voting proved to be effective in most domains in the experiments. Stacking, on the other hand, proved to be consistently effective over all domains, doing comparably or better than voting and always better than the best base-level systems. Particular emphasis is also given to explaining the results obtained by voting and stacking at the meta-level, with respect to the varying degree of similarity in the output of the base-level systems. Keywords: stacking, voting, information extraction, cross-validation 1
2 0.84272492 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
Author: Hyunsoo Kim, Peg Howland, Haesun Park
Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced. Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids
Author: Jieping Ye
Abstract: A generalized discriminant analysis based on a new optimization criterion is presented. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) when the scatter matrices are singular. An efficient algorithm for the new optimization problem is presented. The solutions to the proposed criterion form a family of algorithms for generalized LDA, which can be characterized in a closed form. We study two specific algorithms, namely Uncorrelated LDA (ULDA) and Orthogonal LDA (OLDA). ULDA was previously proposed for feature extraction and dimension reduction, whereas OLDA is a novel algorithm proposed in this paper. The features in the reduced space of ULDA are uncorrelated, while the discriminant vectors of OLDA are orthogonal to each other. We have conducted a comparative study on a variety of real-world data sets to evaluate ULDA and OLDA in terms of classification accuracy. Keywords: dimension reduction, linear discriminant analysis, uncorrelated LDA, orthogonal LDA, singular value decomposition
4 0.2257721 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
Author: Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces. Keywords: clustering, directional distributions, mixtures, von Mises-Fisher, expectation maximization, maximum likelihood, high dimensional data c 2005 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh and Suvrit Sra. BANERJEE , D HILLON , G HOSH AND S RA
5 0.19480295 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
Author: Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall
Abstract: Many learning algorithms use a metric defined over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classification. Specifically, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher’s linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods. Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information.
6 0.18610239 54 jmlr-2005-Managing Diversity in Regression Ensembles
7 0.18433055 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
8 0.18116777 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
9 0.17595956 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
10 0.17540149 3 jmlr-2005-A Classification Framework for Anomaly Detection
11 0.17505002 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
12 0.17330046 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
13 0.17030041 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
14 0.16912392 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
15 0.16730621 64 jmlr-2005-Semigroup Kernels on Measures
16 0.16593665 36 jmlr-2005-Gaussian Processes for Ordinal Regression
17 0.16555665 39 jmlr-2005-Information Bottleneck for Gaussian Variables
18 0.16263559 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
19 0.158636 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
20 0.15790911 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching