jmlr jmlr2008 jmlr2008-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
Reference: text
sentIndex sentText sentNum sentScore
1 As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. [sent-13, score-0.351]
2 Gradient tree boosting also makes it possible to use instance weighting (as in C4. [sent-14, score-0.282]
3 5) and surrogate splitting (as in CART) to handle missing values. [sent-15, score-0.391]
4 Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. [sent-16, score-0.672]
5 Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values 1. [sent-17, score-0.391]
6 In computational biology, the task of protein secondary structure prediction is to assign a secondary structure class to each amino acid residue in the protein sequence (Qian and Sejnowski, 1988). [sent-21, score-0.336]
7 To predict label yt , a sliding window method uses features drawn from some “window” of the X sequence. [sent-40, score-0.515]
8 For example, a 5-element window wt (X) would use the features xt−2 , xt−1 , xt , xt+1 , xt+2 . [sent-41, score-0.248]
9 However, in most SSL problems, there are correlations among successive class labels yt . [sent-43, score-0.268]
10 The best-known method for capturing the yt−1 ↔ yt correlation is the hidden Markov model (HMM) (Rabiner, 1989), which is a generative model of P(X,Y ), the joint distribution of the observation sequence and label sequence. [sent-47, score-0.323]
11 This assumption of independence of each input feature xt, j conditioned on yt makes HMMs unable to model arbitrary, non-independent input features, and this limits the accuracy and “engineerability” of HMMs. [sent-49, score-0.349]
12 MEMMs are directed graphical models of the form P(Y |X) = ∏t P(yt |yt−1 , wt (X)), where wt (X) is a sliding window over the X sequence. [sent-55, score-0.269]
13 Conditional random fields are undirected models of the form P(Y |X) = 1/Z(X) exp ∑t Ψ(yt , yt−1 , wt (X)), where Z(X) is a global normalizing term and Ψ(yt , yt−1 , wt (X)) is a potential function that scores the compatibility of yt , yt−1 , and wt (X). [sent-57, score-0.468]
14 In this paper, we introduce a different approach for training the potential functions based on Friedman’s gradient tree boosting algorithm (Friedman, 2001). [sent-89, score-0.308]
15 Each regression tree can be viewed as defining several new feature combinations—one corresponding to each path in the tree from the root to a leaf. [sent-93, score-0.247]
16 Another advantage of tree boosting is that it is able to handle missing values in the inputs using clever methods specific to regression trees, such as the instance weighting method of C4. [sent-95, score-0.521]
17 This paper describes the gradient tree boosting algorithm including methods for incorporating weight penalties into the procedure. [sent-100, score-0.23]
18 We also perform experiments to evaluate the effectiveness of four methods for handling missing values (instance weighting, surrogate splits, indicator features, and imputation). [sent-103, score-0.398]
19 The results show that instance weighting works best, but that imputation also works surprisingly well. [sent-104, score-0.29]
20 First, for CRF models, instance weighting combined with gradient tree boosting can be recommended as a good algorithm for learning in the presence of missing values. [sent-106, score-0.561]
21 Second, for all SSL methods, imputation can be employed to provide a reasonable missing values method. [sent-107, score-0.371]
22 , yT ) is the sequence of labels, where yt ∈ {1, . [sent-115, score-0.293]
23 The exponential function ensures that P(Y |X) is positive, and the normalizing constant Z(X) = ∑Y exp[∑t Ψt (yt , X) + Ψt−1,t (yt−1 , yt , X)] ensures that P(Y |X) sums to 1. [sent-122, score-0.268]
24 (2001) studied potential functions that are weighted combinations of binary features: Ψt (yt , X) = ∑ βa ga (yt , X) , a Ψt−1,t (yt−1 , yt , X) = ∑ λb fb (yt−1 , yt , X) , b where the βa ’s and λb ’s are trainable weights, and the features ga and fb are boolean functions. [sent-127, score-0.666]
25 In part-of-speech tagging, for example, g234 (yt , X) might be 1 when xt is the word “bank” and yt is the class “noun” (and 0 otherwise). [sent-128, score-0.348]
26 As with sliding window methods, it is natural to define features that depend only on a sliding window wt (X) of X values. [sent-129, score-0.43]
27 For example, in protein secondary structure prediction, Qian and Sejnowski (1988) found that a 13-residue sliding window gave best results for neural network methods. [sent-142, score-0.269]
28 When feature induction is turned on, the learner starts with a single constant feature and (every 8 iterations) introduces new feature conjunctions by taking conjunctions of the basic features with features already in the model. [sent-147, score-0.375]
29 In this paper, we employ the gradient tree boosting method (Friedman, 2001) to construct complex features from the basic features as part of a stage-wise construction of the potential functions. [sent-150, score-0.382]
30 ∑y exp Ψ(y , x) Gradient tree boosting is based on the idea of functional gradient ascent. [sent-161, score-0.259]
31 We do have a set of training examples sampled from this joint distribution, so we can compute the value of the functional gradient at each of our training data points: ∆m (yi , xi ) = ∇Ψ ∑ log P(yi | xi ; Ψ) i . [sent-175, score-0.24]
32 Ψm−1 We can then use these point-wise functional gradients to define a set of functional gradient training examples, ((xi , yi ), ∆m (yi , xi )), and then train a function hm (y, x) so that it approximates ∆m (yi , xi ). [sent-176, score-0.258]
33 All we need to do is to represent and train Ψ(yt , X) and Ψ(yt−1 , yt , X) as weighted sums of regression trees. [sent-207, score-0.301]
34 Let F yt (yt−1 , X) = Ψ(yt , X) + Ψ(yt−1 , yt , X) be a function that computes the “desirability” of label yt given values for label yt−1 and the input features X. [sent-208, score-0.924]
35 With this definition, the CRF has the form 1 P(Y |X) = exp ∑ F yt (yt−1 , X) . [sent-210, score-0.268]
36 Z(X) t 2119 D IETTERICH , H AO AND A SHENFELTER We now compute the functional gradient of log P(Y |X) with respect to F yt (yt−1 , X). [sent-211, score-0.37]
37 We will assume that yt takes the value ⊥ for t < 1. [sent-216, score-0.268]
38 The second way to make predictions is to individually predict each yt according to Ht (X) = argmax P(yt = v|X) , v and then concatenate these individual predictions to obtain H(X). [sent-274, score-0.268]
39 This makes sense in applications where the goal is to maximize the number of individual yt ’s correctly predicted, even if the resulting predicted sequence Y is incoherent. [sent-275, score-0.317]
40 The values of input features can be missing for a wide variety of reasons. [sent-282, score-0.266]
41 (2006) used feature bagging method to deal with SSL problems where highly indicative features may be missing in the test data. [sent-287, score-0.314]
42 There exist very good methods for handing missing values when growing regression trees, which include instance weighting method of C4. [sent-292, score-0.364]
43 An advantage of training CRFs with gradient tree boosting is that these missing values methods can be used directly in the process of generating regression trees over the functional gradient training examples. [sent-295, score-0.695]
44 1 Instance Weighting The instance weighting method (Quinlan, 1993), also known as “proportional distribution”, assigns a weight to each training example, and all splitting decisions are based on weighted statistics. [sent-297, score-0.244]
45 The best feature x j∗ is chosen, and the training examples for which x j∗ is not missing are sent to the appropriate child node. [sent-301, score-0.335]
46 nle f t + nright Instance weighting assumes that the training and test examples missing x j∗ will on average behave exactly like the training examples for which x j∗ is not missing. [sent-312, score-0.481]
47 Hence, each xt consists of a line in the FAQ file, and the corresponding yt ∈ {header, question, answer, tail}. [sent-409, score-0.317]
48 In T REE CRF, the cost of feature induction is the cost of growing a regression tree, which depends on the number of basic features n and the number of internal nodes in the tree L. [sent-552, score-0.285]
49 For each sequence length, we generated a training data set with 100 sequences and employed a sliding window of size 3. [sent-568, score-0.259]
50 We tried sliding window sizes of 3, 5, 7, 9 and 11, so that the number of input features at each sequence position takes the values of 75, 125, 175, 225 and 275 (because each input observation is represented by 25 boolean indicator variables). [sent-574, score-0.36]
51 In addition to the instance weighting and surrogate splitting methods described above, we also studied two simpler methods: imputation and indicator features. [sent-584, score-0.555]
52 Indicator Features: a boolean feature xt j is introduced for each feature xt j such that if xt j is ˜ present, xt j is false. [sent-596, score-0.33]
53 Indicator features make sense when the fact that a value is missing is itself informative. [sent-598, score-0.266]
54 For example, if xt j represents a temperature reading, it may be that extremely cold temperature values tend to be missing because of sensor failure. [sent-599, score-0.255]
55 For each learning problem, we took the chosen training and test sets and inject missing values at rates of 5%, 10%, 20% and 40%. [sent-602, score-0.252]
56 For a given missing rate, we generate five versions of the training set and five versions of the test set. [sent-603, score-0.252]
57 A CRF is then trained on each of the training sets and evaluated on each of the test sets (for a total of 5 CRFs and 25 evaluations per missing rate). [sent-604, score-0.252]
58 , we compute yt = argmaxyt P(yt |X) ˆ for each t separately). [sent-607, score-0.268]
59 5 instance weighting surrogate splitting imputation indicator feature 100 prediction accuracy (%) 90. [sent-612, score-0.659]
60 5 prediction accuracy (%) (b) NETtalk instance weighting surrogate splitting imputation indicator feature 91 40 90 89. [sent-613, score-0.659]
61 5 86 96 5 10 20 40 5 missing rate (%) 10 20 40 missing rate (%) (c) Hyphen (d) FAQ ai-general Figure 4: Performance of missing values methods for different missing rates. [sent-621, score-0.824]
62 where m1 , m2 , and m3 are boolean indicator variables that specify which missing values method we are using and the S ’s are indicator variables that specify which of the five training sets we are using. [sent-622, score-0.45]
63 If m1 = 1, this indicates surrogate splitting, m2 = 1 indicates imputation, and m3 = 1 indicates the indicator feature method. [sent-624, score-0.24]
64 We can then test the hypothesis δi = 0 against the null hypothesis δi = 0 to determine whether missing values method i is different from the baseline method. [sent-626, score-0.28]
65 Figure 4a shows that instance weighting achieves the best prediction accuracy for each of the different missing rates. [sent-629, score-0.387]
66 Table 7a shows that the base line missing values method, instance weighting, is statistically better than the other three missing values methods in most cases. [sent-630, score-0.482]
67 In FAQ ai-general problem, imputation was the baseline method, so the coefficient values give the log odds of the change in accuracy relative to imputation. [sent-681, score-0.272]
68 In Figure 4b, we see that instance weighting does better than the other three missing values methods for all the different missing rates. [sent-685, score-0.537]
69 The statistical tests reported in Table 7b show that the baseline method, instance weighting, is statistically better than each of the other missing value methods in all cases. [sent-686, score-0.35]
70 Figure 4c shows that instance weighting is the best missing values method except for a missing rate of 5%. [sent-688, score-0.537]
71 Statistical tests shown in Table 7c tell us that for missing rate of 5%, surrogate splitting is the best missing values method and the other three methods are not statistically significantly different from each other. [sent-689, score-0.645]
72 For a missing rate of 10%, instance weighting and imputation are statistically better than the other two methods (and indistinguishable from each other). [sent-690, score-0.544]
73 For missing rates of 20% and 40%, instance weighting is statistically better than the other three methods. [sent-691, score-0.379]
74 Unlike in the previous data sets, instance weighting is no longer the best missing values method, as shown in Figure 4d. [sent-696, score-0.331]
75 Instead, imputation performs very well for various missing value rates. [sent-697, score-0.371]
76 Table 7d shows that imputation is statistically the best missing values method. [sent-698, score-0.419]
77 For missing rates of 10% and 40%, it is statistically better than the other three methods. [sent-699, score-0.254]
78 For a missing rate of 5%, it does as well as instance weighting and surrogate splitting. [sent-700, score-0.443]
79 For a missing rate of 20%, it does as well as surrogate splitting and indicator features. [sent-701, score-0.471]
80 Hence, in the protein data set, missing values are introduced by choosing an amino acid residue position in the observa2134 G RADIENT T REE B OOSTING FOR T RAINING C ONDITIONAL R ANDOM F IELDS 1 Fraction of time true 0. [sent-706, score-0.349]
81 Similarly, in the NETtalk and hyphenation problems, a letter is made to be missing by setting all 26 indicator features for that letter to missing. [sent-713, score-0.448]
82 Similarly, imputation is computed at the amino acid or letter level, not at the level of boolean features. [sent-714, score-0.296]
83 However, in the Usenet FAQ data set, since the binary features are not exclusive, imputation is computed at the level of boolean features. [sent-715, score-0.263]
84 In the case of protein sequences, imputation will replace missing values with the most frequently-occurring amino acid, which is alanine, code ‘A’. [sent-716, score-0.483]
85 Alanine tends to form alpha helices, so this may cause the learning algorithms to over-predict the helix class, which may explain why imputation performed worst on the protein data set. [sent-717, score-0.244]
86 In the case of English words, the most common letter is ‘E’, and it does not carry much information either about pronunciation or about hyphenation, so this may explain why imputation worked well in the NETtalk and hyphenation problems. [sent-718, score-0.238]
87 Hence, in most cases, imputation with the most common feature value will supply the correct missing value. [sent-720, score-0.419]
88 , in medicine, a feature could be missing because a physician explicitly chose not to measure it). [sent-724, score-0.254]
89 Because features were marked as missing completely at random, this is not true, so the indicator feature carries no positive information about the class label. [sent-725, score-0.394]
90 However, in cases where imputation causes problems, the indicator feature approach may help prevent those problems by being more neutral. [sent-726, score-0.293]
91 This may explain why the indicator feature method works slightly better in most cases than the imputation method. [sent-728, score-0.293]
92 2135 D IETTERICH , H AO AND A SHENFELTER The surrogate splitting method assumes that the input features are correlated with one another, so that if one feature is missing, its value can be computed from another feature. [sent-729, score-0.293]
93 In the experiment, each of these 20 features could be independently marked as missing, which is a bit unrealistic, since presumably the real missing values would involve some loss or corruption of the words making up the line, and this would affect multiple features. [sent-737, score-0.266]
94 The 20 features do have some redundancy, so we would expect that surrogate splitting should work well, and it does for 5% and 20% missing rates. [sent-738, score-0.451]
95 The instance weighting method assumes that the feature values are missing at random and that the other features provide no redundant information, so the most sensible thing to do is to marginalize away the uncertainty about the missing values. [sent-739, score-0.645]
96 T REE CRF also provides us with extra ability to handle missing data with instance weighting and surrogate splitting methods, which are not available in M ALLET and other CRF training algorithms. [sent-752, score-0.562]
97 The experiments suggest that when the feature values are missing at random, the instance weighting approach works very well. [sent-753, score-0.379]
98 In the one domain where instance weighting did not work well, imputation was the best method. [sent-754, score-0.29]
99 The good performance of the indicator features and imputation methods is encouraging, because these methods can be applied with all known methods for sequential supervised learning, not only with gradient tree 2136 G RADIENT T REE B OOSTING FOR T RAINING C ONDITIONAL R ANDOM F IELDS boosting. [sent-758, score-0.461]
100 Since there is no one best method for handling missing values, as with many other aspects of machine learning, preliminary experiments on subsets of the training data are required to select the most appropriate method. [sent-759, score-0.252]
wordName wordTfidf (topN-words)
[('ree', 0.468), ('allet', 0.37), ('crf', 0.343), ('yt', 0.268), ('faq', 0.234), ('missing', 0.206), ('imputation', 0.165), ('nettalk', 0.138), ('surrogate', 0.112), ('yd', 0.103), ('weighting', 0.103), ('hyphen', 0.094), ('ietterich', 0.094), ('shenfelter', 0.094), ('crfs', 0.091), ('mallet', 0.087), ('ssl', 0.087), ('wd', 0.086), ('tree', 0.083), ('window', 0.083), ('ields', 0.08), ('onditional', 0.08), ('raining', 0.08), ('indicator', 0.08), ('mccallum', 0.079), ('protein', 0.079), ('boosting', 0.074), ('baseline', 0.074), ('sliding', 0.074), ('cpu', 0.074), ('gradient', 0.073), ('splitting', 0.073), ('radient', 0.072), ('oosting', 0.072), ('ao', 0.066), ('features', 0.06), ('nodequeue', 0.058), ('andom', 0.057), ('wt', 0.056), ('aix', 0.051), ('xt', 0.049), ('statistically', 0.048), ('feature', 0.048), ('training', 0.046), ('hyphenation', 0.044), ('nright', 0.044), ('splitfeature', 0.044), ('cumulative', 0.038), ('boolean', 0.038), ('induction', 0.037), ('conjunctions', 0.037), ('seconds', 0.036), ('nle', 0.036), ('plitattribute', 0.036), ('splitscore', 0.036), ('cart', 0.035), ('sent', 0.035), ('ascent', 0.035), ('hm', 0.035), ('amino', 0.033), ('regression', 0.033), ('accuracy', 0.033), ('secondary', 0.033), ('potential', 0.032), ('trees', 0.032), ('shrinkage', 0.032), ('andrew', 0.032), ('sequences', 0.031), ('word', 0.031), ('acid', 0.031), ('est', 0.031), ('ind', 0.031), ('label', 0.03), ('viterbi', 0.03), ('english', 0.03), ('iterations', 0.03), ('falsedata', 0.029), ('mcnemar', 0.029), ('numleaves', 0.029), ('qian', 0.029), ('quared', 0.029), ('treecrf', 0.029), ('truedata', 0.029), ('veb', 0.029), ('functional', 0.029), ('letter', 0.029), ('tagging', 0.028), ('sejnowski', 0.026), ('stress', 0.026), ('spent', 0.026), ('sequence', 0.025), ('predicted', 0.024), ('internal', 0.024), ('les', 0.024), ('xi', 0.023), ('prediction', 0.023), ('instance', 0.022), ('elds', 0.022), ('enerate', 0.022), ('newsgroup', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
2 0.15268147 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
3 0.11412671 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
Author: Vojtěch Franc, Bogdan Savchynskyy
Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines
4 0.08171311 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
5 0.062918536 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett
Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random fields ∗. These authors contributed equally. c 2008 Michael Col
6 0.061896104 45 jmlr-2008-JNCC2: The Java Implementation Of Naive Credal Classifier 2 (Machine Learning Open Source Software Paper)
7 0.058981113 56 jmlr-2008-Magic Moments for Structured Output Prediction
8 0.056443024 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting
9 0.053338379 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
10 0.050300039 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
11 0.050112016 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
12 0.046805419 54 jmlr-2008-Learning to Select Features using their Properties
13 0.045604307 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
14 0.041514955 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
15 0.037210837 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
16 0.035941146 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
17 0.03464257 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
18 0.034170199 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
19 0.03165153 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
topicId topicWeight
[(0, 0.181), (1, -0.018), (2, 0.165), (3, 0.042), (4, -0.258), (5, 0.123), (6, 0.084), (7, 0.055), (8, -0.103), (9, -0.102), (10, -0.092), (11, -0.113), (12, -0.001), (13, -0.047), (14, -0.035), (15, -0.103), (16, -0.253), (17, 0.105), (18, 0.0), (19, 0.024), (20, -0.019), (21, 0.074), (22, -0.016), (23, -0.036), (24, 0.293), (25, 0.046), (26, 0.088), (27, -0.026), (28, 0.172), (29, 0.019), (30, -0.097), (31, -0.075), (32, -0.036), (33, 0.125), (34, -0.08), (35, -0.056), (36, 0.124), (37, -0.067), (38, -0.215), (39, 0.091), (40, 0.049), (41, -0.036), (42, 0.041), (43, -0.005), (44, -0.212), (45, -0.13), (46, -0.057), (47, -0.055), (48, 0.025), (49, -0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.94199848 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
2 0.56569749 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
3 0.52714467 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
Author: Vojtěch Franc, Bogdan Savchynskyy
Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines
4 0.37532678 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
Author: Giorgio Corani, Marco Zaffalon
Abstract: JNCC2 implements the naive credal classifier 2 (NCC2). This is an extension of naive Bayes to imprecise probabilities that aims at delivering robust classifications also when dealing with small or incomplete data sets. Robustness is achieved by delivering set-valued classifications (that is, returning multiple classes) on the instances for which (i) the learning set is not informative enough to smooth the effect of choice of the prior density or (ii) the uncertainty arising from missing data prevents the reliable indication of a single class. JNCC2 is released under the GNU GPL license. Keywords: imprecise probabilities, missing data, naive Bayes, naive credal classifier 2, Java
6 0.23081382 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
7 0.22083692 56 jmlr-2008-Magic Moments for Structured Output Prediction
8 0.21849447 54 jmlr-2008-Learning to Select Features using their Properties
9 0.21336268 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
10 0.19722679 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
11 0.18758793 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
12 0.17562908 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
13 0.17469595 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
14 0.17034744 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
15 0.1693269 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
16 0.16768365 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting
17 0.13762209 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
18 0.13401249 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
19 0.12821904 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
20 0.12622632 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
topicId topicWeight
[(0, 0.016), (5, 0.026), (10, 0.011), (24, 0.316), (31, 0.019), (40, 0.044), (54, 0.064), (58, 0.034), (66, 0.051), (69, 0.012), (72, 0.022), (76, 0.024), (79, 0.011), (82, 0.038), (88, 0.061), (91, 0.025), (92, 0.033), (94, 0.06), (99, 0.022)]
simIndex simValue paperId paperTitle
1 0.87436575 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective
Author: Liviu Panait, Karl Tuyls, Sean Luke
Abstract: This paper presents the dynamics of multiple learning agents from an evolutionary game theoretic perspective. We provide replicator dynamics models for cooperative coevolutionary algorithms and for traditional multiagent Q-learning, and we extend these differential equations to account for lenient learners: agents that forgive possible mismatched teammate actions that resulted in low rewards. We use these extended formal models to study the convergence guarantees for these algorithms, and also to visualize the basins of attraction to optimal and suboptimal solutions in two benchmark coordination problems. The paper demonstrates that lenience provides learners with more accurate information about the benefits of performing their actions, resulting in higher likelihood of convergence to the globally optimal solution. In addition, the analysis indicates that the choice of learning algorithm has an insignificant impact on the overall performance of multiagent learning algorithms; rather, the performance of these algorithms depends primarily on the level of lenience that the agents exhibit to one another. Finally, the research herein supports the strength and generality of evolutionary game theory as a backbone for multiagent learning. Keywords: multiagent learning, reinforcement learning, cooperative coevolution, evolutionary game theory, formal models, visualization, basins of attraction
same-paper 2 0.73864275 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
3 0.3895838 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
Author: Faustino Gomez, Jürgen Schmidhuber, Risto Miikkulainen
Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artificial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be significantly more efficient and powerful than the other methods on these tasks. Keywords: coevolution, recurrent neural networks, non-linear control, genetic algorithms, experimental comparison
4 0.38894618 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
Author: Salvador García, Francisco Herrera
Abstract: In a recently published paper in JMLR, Demˇar (2006) recommends a set of non-parametric stas tistical tests and procedures which can be safely used for comparing the performance of classifiers over multiple data sets. After studying the paper, we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method. However, it does not deal with some advanced topics in depth. Regarding these topics, we focus on more powerful proposals of statistical procedures for comparing n × n classifiers. Moreover, we illustrate an easy way of obtaining adjusted and comparable p-values in multiple comparison procedures. Keywords: statistical methods, non-parametric test, multiple comparisons tests, adjusted p-values, logically related hypotheses
6 0.37408808 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
7 0.36827573 83 jmlr-2008-Robust Submodular Observation Selection
8 0.3682245 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
9 0.367134 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
10 0.36485925 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
11 0.36216903 86 jmlr-2008-SimpleMKL
12 0.36210412 58 jmlr-2008-Max-margin Classification of Data with Absent Features
13 0.36116943 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2
14 0.35804665 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
15 0.35794866 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
16 0.35589358 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
17 0.35459393 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
18 0.35381538 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
19 0.35360411 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
20 0.35182726 57 jmlr-2008-Manifold Learning: The Price of Normalization