jmlr jmlr2008 jmlr2008-39 knowledge-graph by maker-knowledge-mining

39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Source: pdf

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

Abstract: Conditional random ﬁelds (CRFs) provide a ﬂexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random ﬁelds, functional gradient, gradient tree boosting, missing values

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. [sent-13, score-0.351]

2 Gradient tree boosting also makes it possible to use instance weighting (as in C4. [sent-14, score-0.282]

3 5) and surrogate splitting (as in CART) to handle missing values. [sent-15, score-0.391]

4 Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. [sent-16, score-0.672]

5 Keywords: sequential supervised learning, conditional random ﬁelds, functional gradient, gradient tree boosting, missing values 1. [sent-17, score-0.391]

6 In computational biology, the task of protein secondary structure prediction is to assign a secondary structure class to each amino acid residue in the protein sequence (Qian and Sejnowski, 1988). [sent-21, score-0.336]

7 To predict label yt , a sliding window method uses features drawn from some “window” of the X sequence. [sent-40, score-0.515]

8 For example, a 5-element window wt (X) would use the features xt−2 , xt−1 , xt , xt+1 , xt+2 . [sent-41, score-0.248]

9 However, in most SSL problems, there are correlations among successive class labels yt . [sent-43, score-0.268]

10 The best-known method for capturing the yt−1 ↔ yt correlation is the hidden Markov model (HMM) (Rabiner, 1989), which is a generative model of P(X,Y ), the joint distribution of the observation sequence and label sequence. [sent-47, score-0.323]

11 This assumption of independence of each input feature xt, j conditioned on yt makes HMMs unable to model arbitrary, non-independent input features, and this limits the accuracy and “engineerability” of HMMs. [sent-49, score-0.349]

12 MEMMs are directed graphical models of the form P(Y |X) = ∏t P(yt |yt−1 , wt (X)), where wt (X) is a sliding window over the X sequence. [sent-55, score-0.269]

13 Conditional random ﬁelds are undirected models of the form P(Y |X) = 1/Z(X) exp ∑t Ψ(yt , yt−1 , wt (X)), where Z(X) is a global normalizing term and Ψ(yt , yt−1 , wt (X)) is a potential function that scores the compatibility of yt , yt−1 , and wt (X). [sent-57, score-0.468]

14 In this paper, we introduce a different approach for training the potential functions based on Friedman’s gradient tree boosting algorithm (Friedman, 2001). [sent-89, score-0.308]

15 Each regression tree can be viewed as deﬁning several new feature combinations—one corresponding to each path in the tree from the root to a leaf. [sent-93, score-0.247]

16 Another advantage of tree boosting is that it is able to handle missing values in the inputs using clever methods speciﬁc to regression trees, such as the instance weighting method of C4. [sent-95, score-0.521]

17 This paper describes the gradient tree boosting algorithm including methods for incorporating weight penalties into the procedure. [sent-100, score-0.23]

18 We also perform experiments to evaluate the effectiveness of four methods for handling missing values (instance weighting, surrogate splits, indicator features, and imputation). [sent-103, score-0.398]

19 The results show that instance weighting works best, but that imputation also works surprisingly well. [sent-104, score-0.29]

20 First, for CRF models, instance weighting combined with gradient tree boosting can be recommended as a good algorithm for learning in the presence of missing values. [sent-106, score-0.561]

21 Second, for all SSL methods, imputation can be employed to provide a reasonable missing values method. [sent-107, score-0.371]

22 , yT ) is the sequence of labels, where yt ∈ {1, . [sent-115, score-0.293]

23 The exponential function ensures that P(Y |X) is positive, and the normalizing constant Z(X) = ∑Y exp[∑t Ψt (yt , X) + Ψt−1,t (yt−1 , yt , X)] ensures that P(Y |X) sums to 1. [sent-122, score-0.268]

24 (2001) studied potential functions that are weighted combinations of binary features: Ψt (yt , X) = ∑ βa ga (yt , X) , a Ψt−1,t (yt−1 , yt , X) = ∑ λb fb (yt−1 , yt , X) , b where the βa ’s and λb ’s are trainable weights, and the features ga and fb are boolean functions. [sent-127, score-0.666]

25 In part-of-speech tagging, for example, g234 (yt , X) might be 1 when xt is the word “bank” and yt is the class “noun” (and 0 otherwise). [sent-128, score-0.348]

26 As with sliding window methods, it is natural to deﬁne features that depend only on a sliding window wt (X) of X values. [sent-129, score-0.43]

27 For example, in protein secondary structure prediction, Qian and Sejnowski (1988) found that a 13-residue sliding window gave best results for neural network methods. [sent-142, score-0.269]

28 When feature induction is turned on, the learner starts with a single constant feature and (every 8 iterations) introduces new feature conjunctions by taking conjunctions of the basic features with features already in the model. [sent-147, score-0.375]

29 In this paper, we employ the gradient tree boosting method (Friedman, 2001) to construct complex features from the basic features as part of a stage-wise construction of the potential functions. [sent-150, score-0.382]

30 ∑y exp Ψ(y , x) Gradient tree boosting is based on the idea of functional gradient ascent. [sent-161, score-0.259]

31 We do have a set of training examples sampled from this joint distribution, so we can compute the value of the functional gradient at each of our training data points: ∆m (yi , xi ) = ∇Ψ ∑ log P(yi | xi ; Ψ) i . [sent-175, score-0.24]

32 Ψm−1 We can then use these point-wise functional gradients to deﬁne a set of functional gradient training examples, ((xi , yi ), ∆m (yi , xi )), and then train a function hm (y, x) so that it approximates ∆m (yi , xi ). [sent-176, score-0.258]

33 All we need to do is to represent and train Ψ(yt , X) and Ψ(yt−1 , yt , X) as weighted sums of regression trees. [sent-207, score-0.301]

34 Let F yt (yt−1 , X) = Ψ(yt , X) + Ψ(yt−1 , yt , X) be a function that computes the “desirability” of label yt given values for label yt−1 and the input features X. [sent-208, score-0.924]

35 With this deﬁnition, the CRF has the form 1 P(Y |X) = exp ∑ F yt (yt−1 , X) . [sent-210, score-0.268]

36 Z(X) t 2119 D IETTERICH , H AO AND A SHENFELTER We now compute the functional gradient of log P(Y |X) with respect to F yt (yt−1 , X). [sent-211, score-0.37]

37 We will assume that yt takes the value ⊥ for t < 1. [sent-216, score-0.268]

38 The second way to make predictions is to individually predict each yt according to Ht (X) = argmax P(yt = v|X) , v and then concatenate these individual predictions to obtain H(X). [sent-274, score-0.268]

39 This makes sense in applications where the goal is to maximize the number of individual yt ’s correctly predicted, even if the resulting predicted sequence Y is incoherent. [sent-275, score-0.317]

40 The values of input features can be missing for a wide variety of reasons. [sent-282, score-0.266]

41 (2006) used feature bagging method to deal with SSL problems where highly indicative features may be missing in the test data. [sent-287, score-0.314]

42 There exist very good methods for handing missing values when growing regression trees, which include instance weighting method of C4. [sent-292, score-0.364]

43 An advantage of training CRFs with gradient tree boosting is that these missing values methods can be used directly in the process of generating regression trees over the functional gradient training examples. [sent-295, score-0.695]

44 1 Instance Weighting The instance weighting method (Quinlan, 1993), also known as “proportional distribution”, assigns a weight to each training example, and all splitting decisions are based on weighted statistics. [sent-297, score-0.244]

45 The best feature x j∗ is chosen, and the training examples for which x j∗ is not missing are sent to the appropriate child node. [sent-301, score-0.335]

46 nle f t + nright Instance weighting assumes that the training and test examples missing x j∗ will on average behave exactly like the training examples for which x j∗ is not missing. [sent-312, score-0.481]

47 Hence, each xt consists of a line in the FAQ ﬁle, and the corresponding yt ∈ {header, question, answer, tail}. [sent-409, score-0.317]

48 In T REE CRF, the cost of feature induction is the cost of growing a regression tree, which depends on the number of basic features n and the number of internal nodes in the tree L. [sent-552, score-0.285]

49 For each sequence length, we generated a training data set with 100 sequences and employed a sliding window of size 3. [sent-568, score-0.259]

50 We tried sliding window sizes of 3, 5, 7, 9 and 11, so that the number of input features at each sequence position takes the values of 75, 125, 175, 225 and 275 (because each input observation is represented by 25 boolean indicator variables). [sent-574, score-0.36]

51 In addition to the instance weighting and surrogate splitting methods described above, we also studied two simpler methods: imputation and indicator features. [sent-584, score-0.555]

52 Indicator Features: a boolean feature xt j is introduced for each feature xt j such that if xt j is ˜ present, xt j is false. [sent-596, score-0.33]

53 Indicator features make sense when the fact that a value is missing is itself informative. [sent-598, score-0.266]

54 For example, if xt j represents a temperature reading, it may be that extremely cold temperature values tend to be missing because of sensor failure. [sent-599, score-0.255]

55 For each learning problem, we took the chosen training and test sets and inject missing values at rates of 5%, 10%, 20% and 40%. [sent-602, score-0.252]

56 For a given missing rate, we generate ﬁve versions of the training set and ﬁve versions of the test set. [sent-603, score-0.252]

57 A CRF is then trained on each of the training sets and evaluated on each of the test sets (for a total of 5 CRFs and 25 evaluations per missing rate). [sent-604, score-0.252]

58 , we compute yt = argmaxyt P(yt |X) ˆ for each t separately). [sent-607, score-0.268]

59 5 instance weighting surrogate splitting imputation indicator feature 100 prediction accuracy (%) 90. [sent-612, score-0.659]

60 5 prediction accuracy (%) (b) NETtalk instance weighting surrogate splitting imputation indicator feature 91 40 90 89. [sent-613, score-0.659]

61 5 86 96 5 10 20 40 5 missing rate (%) 10 20 40 missing rate (%) (c) Hyphen (d) FAQ ai-general Figure 4: Performance of missing values methods for different missing rates. [sent-621, score-0.824]

62 where m1 , m2 , and m3 are boolean indicator variables that specify which missing values method we are using and the S ’s are indicator variables that specify which of the ﬁve training sets we are using. [sent-622, score-0.45]

63 If m1 = 1, this indicates surrogate splitting, m2 = 1 indicates imputation, and m3 = 1 indicates the indicator feature method. [sent-624, score-0.24]

64 We can then test the hypothesis δi = 0 against the null hypothesis δi = 0 to determine whether missing values method i is different from the baseline method. [sent-626, score-0.28]

65 Figure 4a shows that instance weighting achieves the best prediction accuracy for each of the different missing rates. [sent-629, score-0.387]

66 Table 7a shows that the base line missing values method, instance weighting, is statistically better than the other three missing values methods in most cases. [sent-630, score-0.482]

67 In FAQ ai-general problem, imputation was the baseline method, so the coefﬁcient values give the log odds of the change in accuracy relative to imputation. [sent-681, score-0.272]

68 In Figure 4b, we see that instance weighting does better than the other three missing values methods for all the different missing rates. [sent-685, score-0.537]

69 The statistical tests reported in Table 7b show that the baseline method, instance weighting, is statistically better than each of the other missing value methods in all cases. [sent-686, score-0.35]

70 Figure 4c shows that instance weighting is the best missing values method except for a missing rate of 5%. [sent-688, score-0.537]

71 Statistical tests shown in Table 7c tell us that for missing rate of 5%, surrogate splitting is the best missing values method and the other three methods are not statistically signiﬁcantly different from each other. [sent-689, score-0.645]

72 For a missing rate of 10%, instance weighting and imputation are statistically better than the other two methods (and indistinguishable from each other). [sent-690, score-0.544]

73 For missing rates of 20% and 40%, instance weighting is statistically better than the other three methods. [sent-691, score-0.379]

74 Unlike in the previous data sets, instance weighting is no longer the best missing values method, as shown in Figure 4d. [sent-696, score-0.331]

75 Instead, imputation performs very well for various missing value rates. [sent-697, score-0.371]

76 Table 7d shows that imputation is statistically the best missing values method. [sent-698, score-0.419]

77 For missing rates of 10% and 40%, it is statistically better than the other three methods. [sent-699, score-0.254]

78 For a missing rate of 5%, it does as well as instance weighting and surrogate splitting. [sent-700, score-0.443]

79 For a missing rate of 20%, it does as well as surrogate splitting and indicator features. [sent-701, score-0.471]

80 Hence, in the protein data set, missing values are introduced by choosing an amino acid residue position in the observa2134 G RADIENT T REE B OOSTING FOR T RAINING C ONDITIONAL R ANDOM F IELDS 1 Fraction of time true 0. [sent-706, score-0.349]

81 Similarly, in the NETtalk and hyphenation problems, a letter is made to be missing by setting all 26 indicator features for that letter to missing. [sent-713, score-0.448]

82 Similarly, imputation is computed at the amino acid or letter level, not at the level of boolean features. [sent-714, score-0.296]

83 However, in the Usenet FAQ data set, since the binary features are not exclusive, imputation is computed at the level of boolean features. [sent-715, score-0.263]

84 In the case of protein sequences, imputation will replace missing values with the most frequently-occurring amino acid, which is alanine, code ‘A’. [sent-716, score-0.483]

85 Alanine tends to form alpha helices, so this may cause the learning algorithms to over-predict the helix class, which may explain why imputation performed worst on the protein data set. [sent-717, score-0.244]

86 In the case of English words, the most common letter is ‘E’, and it does not carry much information either about pronunciation or about hyphenation, so this may explain why imputation worked well in the NETtalk and hyphenation problems. [sent-718, score-0.238]

87 Hence, in most cases, imputation with the most common feature value will supply the correct missing value. [sent-720, score-0.419]

88 , in medicine, a feature could be missing because a physician explicitly chose not to measure it). [sent-724, score-0.254]

89 Because features were marked as missing completely at random, this is not true, so the indicator feature carries no positive information about the class label. [sent-725, score-0.394]

90 However, in cases where imputation causes problems, the indicator feature approach may help prevent those problems by being more neutral. [sent-726, score-0.293]

91 This may explain why the indicator feature method works slightly better in most cases than the imputation method. [sent-728, score-0.293]

92 2135 D IETTERICH , H AO AND A SHENFELTER The surrogate splitting method assumes that the input features are correlated with one another, so that if one feature is missing, its value can be computed from another feature. [sent-729, score-0.293]

93 In the experiment, each of these 20 features could be independently marked as missing, which is a bit unrealistic, since presumably the real missing values would involve some loss or corruption of the words making up the line, and this would affect multiple features. [sent-737, score-0.266]

94 The 20 features do have some redundancy, so we would expect that surrogate splitting should work well, and it does for 5% and 20% missing rates. [sent-738, score-0.451]

95 The instance weighting method assumes that the feature values are missing at random and that the other features provide no redundant information, so the most sensible thing to do is to marginalize away the uncertainty about the missing values. [sent-739, score-0.645]

96 T REE CRF also provides us with extra ability to handle missing data with instance weighting and surrogate splitting methods, which are not available in M ALLET and other CRF training algorithms. [sent-752, score-0.562]

97 The experiments suggest that when the feature values are missing at random, the instance weighting approach works very well. [sent-753, score-0.379]

98 In the one domain where instance weighting did not work well, imputation was the best method. [sent-754, score-0.29]

99 The good performance of the indicator features and imputation methods is encouraging, because these methods can be applied with all known methods for sequential supervised learning, not only with gradient tree 2136 G RADIENT T REE B OOSTING FOR T RAINING C ONDITIONAL R ANDOM F IELDS boosting. [sent-758, score-0.461]

100 Since there is no one best method for handling missing values, as with many other aspects of machine learning, preliminary experiments on subsets of the training data are required to select the most appropriate method. [sent-759, score-0.252]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ree', 0.468), ('allet', 0.37), ('crf', 0.343), ('yt', 0.268), ('faq', 0.234), ('missing', 0.206), ('imputation', 0.165), ('nettalk', 0.138), ('surrogate', 0.112), ('yd', 0.103), ('weighting', 0.103), ('hyphen', 0.094), ('ietterich', 0.094), ('shenfelter', 0.094), ('crfs', 0.091), ('mallet', 0.087), ('ssl', 0.087), ('wd', 0.086), ('tree', 0.083), ('window', 0.083), ('ields', 0.08), ('onditional', 0.08), ('raining', 0.08), ('indicator', 0.08), ('mccallum', 0.079), ('protein', 0.079), ('boosting', 0.074), ('baseline', 0.074), ('sliding', 0.074), ('cpu', 0.074), ('gradient', 0.073), ('splitting', 0.073), ('radient', 0.072), ('oosting', 0.072), ('ao', 0.066), ('features', 0.06), ('nodequeue', 0.058), ('andom', 0.057), ('wt', 0.056), ('aix', 0.051), ('xt', 0.049), ('statistically', 0.048), ('feature', 0.048), ('training', 0.046), ('hyphenation', 0.044), ('nright', 0.044), ('splitfeature', 0.044), ('cumulative', 0.038), ('boolean', 0.038), ('induction', 0.037), ('conjunctions', 0.037), ('seconds', 0.036), ('nle', 0.036), ('plitattribute', 0.036), ('splitscore', 0.036), ('cart', 0.035), ('sent', 0.035), ('ascent', 0.035), ('hm', 0.035), ('amino', 0.033), ('regression', 0.033), ('accuracy', 0.033), ('secondary', 0.033), ('potential', 0.032), ('trees', 0.032), ('shrinkage', 0.032), ('andrew', 0.032), ('sequences', 0.031), ('word', 0.031), ('acid', 0.031), ('est', 0.031), ('ind', 0.031), ('label', 0.03), ('viterbi', 0.03), ('english', 0.03), ('iterations', 0.03), ('falsedata', 0.029), ('mcnemar', 0.029), ('numleaves', 0.029), ('qian', 0.029), ('quared', 0.029), ('treecrf', 0.029), ('truedata', 0.029), ('veb', 0.029), ('functional', 0.029), ('letter', 0.029), ('tagging', 0.028), ('sejnowski', 0.026), ('stress', 0.026), ('spent', 0.026), ('sequence', 0.025), ('predicted', 0.024), ('internal', 0.024), ('les', 0.024), ('xi', 0.023), ('prediction', 0.023), ('instance', 0.022), ('elds', 0.022), ('enerate', 0.022), ('newsgroup', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

2 0.15268147 58 jmlr-2008-Max-margin Classification of Data with Absent Features

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

Abstract: We consider the problem of learning classiﬁers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to ﬁrst complete their unknown values, and then use a standard classiﬁcation procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classiﬁed directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efﬁciently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa

3 0.11412671 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

Author: Vojtěch Franc, Bogdan Savchynskyy

Abstract: The max-sum classiﬁer predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions deﬁned over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classiﬁer. Learning parameters of the max-sum classiﬁer is a challenging problem because even computing the response of such classiﬁer is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classiﬁers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classiﬁers, have been extended for learning some subclasses of the max-sum classiﬁer. Besides the max-sum classiﬁers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulﬁll some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classiﬁers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classiﬁers whose response can be computed in polynomial time: (i) the max-sum classiﬁers with supermodular quality functions and (ii) the max-sum classiﬁers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classiﬁer. Keywords: max-xum classiﬁer, hidden Markov networks, support vector machines

4 0.08171311 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen

Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and deﬁne a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to ﬁnd the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve signiﬁcant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by ﬁxed-structured hierarchical models. Keywords: conditional random ﬁelds, dynamic hierarchical Markov random ﬁelds, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue

5 0.062918536 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett

Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efﬁcient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments conﬁrm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efﬁciently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random ﬁelds ∗. These authors contributed equally. c 2008 Michael Col

6 0.061896104 45 jmlr-2008-JNCC2: The Java Implementation Of Naive Credal Classifier 2 (Machine Learning Open Source Software Paper)

7 0.058981113 56 jmlr-2008-Magic Moments for Structured Output Prediction

8 0.056443024 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

9 0.053338379 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

10 0.050300039 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models

11 0.050112016 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

12 0.046805419 54 jmlr-2008-Learning to Select Features using their Properties

13 0.045604307 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

14 0.041514955 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

15 0.037210837 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

16 0.035941146 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

17 0.03464257 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

18 0.034170199 91 jmlr-2008-Trust Region Newton Method for Logistic Regression

19 0.03165153 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

20 0.031294178 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.181), (1, -0.018), (2, 0.165), (3, 0.042), (4, -0.258), (5, 0.123), (6, 0.084), (7, 0.055), (8, -0.103), (9, -0.102), (10, -0.092), (11, -0.113), (12, -0.001), (13, -0.047), (14, -0.035), (15, -0.103), (16, -0.253), (17, 0.105), (18, 0.0), (19, 0.024), (20, -0.019), (21, 0.074), (22, -0.016), (23, -0.036), (24, 0.293), (25, 0.046), (26, 0.088), (27, -0.026), (28, 0.172), (29, 0.019), (30, -0.097), (31, -0.075), (32, -0.036), (33, 0.125), (34, -0.08), (35, -0.056), (36, 0.124), (37, -0.067), (38, -0.215), (39, 0.091), (40, 0.049), (41, -0.036), (42, 0.041), (43, -0.005), (44, -0.212), (45, -0.13), (46, -0.057), (47, -0.055), (48, 0.025), (49, -0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94199848 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

2 0.56569749 58 jmlr-2008-Max-margin Classification of Data with Absent Features

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

3 0.52714467 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

Author: Vojtěch Franc, Bogdan Savchynskyy

4 0.37532678 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen

5 0.26056001 45 jmlr-2008-JNCC2: The Java Implementation Of Naive Credal Classifier 2 (Machine Learning Open Source Software Paper)

Author: Giorgio Corani, Marco Zaffalon

Abstract: JNCC2 implements the naive credal classiﬁer 2 (NCC2). This is an extension of naive Bayes to imprecise probabilities that aims at delivering robust classiﬁcations also when dealing with small or incomplete data sets. Robustness is achieved by delivering set-valued classiﬁcations (that is, returning multiple classes) on the instances for which (i) the learning set is not informative enough to smooth the effect of choice of the prior density or (ii) the uncertainty arising from missing data prevents the reliable indication of a single class. JNCC2 is released under the GNU GPL license. Keywords: imprecise probabilities, missing data, naive Bayes, naive credal classiﬁer 2, Java

6 0.23081382 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

7 0.22083692 56 jmlr-2008-Magic Moments for Structured Output Prediction

8 0.21849447 54 jmlr-2008-Learning to Select Features using their Properties

9 0.21336268 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

10 0.19722679 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

11 0.18758793 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

12 0.17562908 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees

13 0.17469595 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)

14 0.17034744 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

15 0.1693269 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models

16 0.16768365 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

17 0.13762209 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

18 0.13401249 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

19 0.12821904 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

20 0.12622632 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (5, 0.026), (10, 0.011), (24, 0.316), (31, 0.019), (40, 0.044), (54, 0.064), (58, 0.034), (66, 0.051), (69, 0.012), (72, 0.022), (76, 0.024), (79, 0.011), (82, 0.038), (88, 0.061), (91, 0.025), (92, 0.033), (94, 0.06), (99, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87436575 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective

Author: Liviu Panait, Karl Tuyls, Sean Luke

Abstract: This paper presents the dynamics of multiple learning agents from an evolutionary game theoretic perspective. We provide replicator dynamics models for cooperative coevolutionary algorithms and for traditional multiagent Q-learning, and we extend these differential equations to account for lenient learners: agents that forgive possible mismatched teammate actions that resulted in low rewards. We use these extended formal models to study the convergence guarantees for these algorithms, and also to visualize the basins of attraction to optimal and suboptimal solutions in two benchmark coordination problems. The paper demonstrates that lenience provides learners with more accurate information about the beneﬁts of performing their actions, resulting in higher likelihood of convergence to the globally optimal solution. In addition, the analysis indicates that the choice of learning algorithm has an insigniﬁcant impact on the overall performance of multiagent learning algorithms; rather, the performance of these algorithms depends primarily on the level of lenience that the agents exhibit to one another. Finally, the research herein supports the strength and generality of evolutionary game theory as a backbone for multiagent learning. Keywords: multiagent learning, reinforcement learning, cooperative coevolution, evolutionary game theory, formal models, visualization, basins of attraction

same-paper 2 0.73864275 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

3 0.3895838 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

Author: Faustino Gomez, Jürgen Schmidhuber, Risto Miikkulainen

Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difﬁcult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artiﬁcial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difﬁcult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be signiﬁcantly more efﬁcient and powerful than the other methods on these tasks. Keywords: coevolution, recurrent neural networks, non-linear control, genetic algorithms, experimental comparison

4 0.38894618 56 jmlr-2008-Magic Moments for Structured Output Prediction

Author: Elisa Ricci, Tijl De Bie, Nello Cristianini

Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efﬁciently computing the ﬁrst two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound

5 0.37978476 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

Author: Salvador García, Francisco Herrera

Abstract: In a recently published paper in JMLR, Demˇar (2006) recommends a set of non-parametric stas tistical tests and procedures which can be safely used for comparing the performance of classiﬁers over multiple data sets. After studying the paper, we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method. However, it does not deal with some advanced topics in depth. Regarding these topics, we focus on more powerful proposals of statistical procedures for comparing n × n classiﬁers. Moreover, we illustrate an easy way of obtaining adjusted and comparable p-values in multiple comparison procedures. Keywords: statistical methods, non-parametric test, multiple comparisons tests, adjusted p-values, logically related hypotheses

6 0.37408808 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

7 0.36827573 83 jmlr-2008-Robust Submodular Observation Selection

8 0.3682245 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

9 0.367134 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

10 0.36485925 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

11 0.36216903 86 jmlr-2008-SimpleMKL

12 0.36210412 58 jmlr-2008-Max-margin Classification of Data with Absent Features

13 0.36116943 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2

14 0.35804665 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

15 0.35794866 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

16 0.35589358 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

17 0.35459393 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

18 0.35381538 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

19 0.35360411 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

20 0.35182726 57 jmlr-2008-Manifold Learning: The Price of Normalization