nips nips2006 nips2006-195 knowledge-graph by maker-knowledge-mining

195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy


Source: pdf

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. [sent-11, score-0.786]

2 We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. [sent-12, score-0.255]

3 In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. [sent-13, score-0.213]

4 CRFs define the conditional distribution Pw (y | x) as a function of features relating labels to the input sequence. [sent-22, score-0.158]

5 Ideally, training a CRF involves finding a parameter set w that gives high accuracy when labeling new sequences. [sent-23, score-0.435]

6 In some cases, however, simply finding parameters that give the best possible accuracy on training data (known as empirical risk minimization [2]) can be difficult. [sent-24, score-0.516]

7 1 Consequently, surrogate optimization problems, such as maximum likelihood or maximum margin training, are solved instead. [sent-26, score-0.519]

8 In this paper, we describe a training procedure that addresses the problem of minimizing empirical per-label risk for CRFs. [sent-27, score-0.301]

9 Specifically, our technique attempts to minimize a smoothed approximation of the Hamming loss incurred by the maximum expected accuracy decoding (i. [sent-28, score-0.615]

10 The degree of approximation is controlled by a parameterized function Q(·) which trades off between the accuracy of the approximation and the smoothness of the objective. [sent-31, score-0.269]

11 In the limit as Q(·) approaches the step function, the optimization objective converges to the empirical risk minimization criterion for Hamming loss. [sent-32, score-0.349]

12 1 The gradient of the optimization objective is everywhere zero (except at points where the objective is discontinuous), because a sufficiently small change in parameters will not change the predicted labeling. [sent-33, score-0.434]

13 Furthermore, for a pair of consecutive labels yj−1 and yj , an input sequence x, and a label position j, let f (yj−1 , yj , x, j) ∈ Rn be a vector-valued function; we call f the feature mapping of the CRF. [sent-36, score-1.112]

14 maximum expected accuracy parsing Given a CRF with parameters w, the sequence labeling task is to determine values for the labels y of a new input sequence x. [sent-40, score-0.685]

15 An alternative approach, which seeks to maximize the per-label accuracy of the prediction rather than the joint probability of the entire parse, chooses the most likely (i. [sent-43, score-0.215]

16 Note that   L arg max y j=1 L Pw (yj | x) = arg max Ey′  y j=1 ′ 1{yj = yj } (2) where 1{condition} denotes the usual indicator function whose value is 1 when condition is true and 0 otherwise, and where the expectation is taken with respect to the conditional distribution Pw (y′ | x). [sent-46, score-0.696]

17 From this, we see that maximum expected accuracy parsing chooses the parse with the maximum expected number of correct labels. [sent-47, score-0.77]

18 In practice, maximum expected accuracy parsing often yields more accurate results than Viterbi parsing (on a per-label basis) [3, 4, 5]. [sent-48, score-0.564]

19 Here, we restrict our focus to maximum expected accuracy parsing procedures and seek training criteria which optimize the performance of a CRF-based maximum expected accuracy parser. [sent-49, score-1.036]

20 3 Training conditional random fields Usually, CRFs are trained in the batch setting, where a complete set D = {(x(t) , y(t) )}m of t=1 training examples is available up front. [sent-50, score-0.281]

21 In this case, training amounts to numerical optimization of a fixed objective function R(w : D). [sent-51, score-0.356]

22 A good objective function is one whose optimal value leads to parameters that perform well, in an application-dependent sense, on previously unseen testing examples. [sent-52, score-0.247]

23 While this can be difficult to achieve without knowing the contents of the testing set, one can, under certain conditions, guarantee that the accuracy of a learned CRF on an unseen testing set is probably not much worse than its accuracy on the training set. [sent-53, score-0.777]

24 ) training and testing examples, there exists a probabilistic bound on the difference between empirical risk and generalization error [2]. [sent-57, score-0.372]

25 As long as enough training data are available (relative to model complexity), strong training set performance will imply, with high probability, similarly strong testing set performance. [sent-58, score-0.379]

26 Loss functions based on usual notions of per-label accuracy (such as Hamming loss) are typically not only nonconvex but also not amenable to optimization by methods that make use of gradient information. [sent-60, score-0.368]

27 In this section, we briefly describe three previous approaches for CRF training which optimize surrogate loss functions in lieu of the empirical risk. [sent-62, score-0.408]

28 Then, we consider a new method for gradient-based CRF training oriented more directly toward optimizing predictive performance on the training set. [sent-63, score-0.329]

29 Our method minimizes an arbitrarily accurate approximation of empirical risk, where the loss function is defined as the number of labels predicted incorrectly by maximum expected accuracy parsing. [sent-64, score-0.654]

30 1 Previous objective functions Conditional log-likelihood Conditional log-likelihood is the most commonly used objective function for training conditional random fields. [sent-68, score-0.579]

31 , conjugate gradient or L-BFGS [6]) will not converge to suboptimal local minima of the objective function. [sent-71, score-0.22]

32 However, there is no guarantee that the parameters obtained by conditional log-likelihood training will lead to the best per-label predictive accuracy, even on the training set. [sent-72, score-0.424]

33 For one, maximum likelihood training explicitly considers only the probability of exact training parses. [sent-73, score-0.497]

34 2 Pointwise conditional log likelihood Kakade et al. [sent-81, score-0.221]

35 investigated an alternative nonconvex training objective for CRFs [7, 8] which considers separately the posterior label probabilities at each position of each training sequence. [sent-82, score-0.674]

36 Nevertheless, pointwise logloss is fundamentally quite different from Hamming loss. [sent-84, score-0.253]

37 A training procedure based on pointwise log likelihood, for example, would prefer to reduce the posterior probability for a correct label from 0. [sent-85, score-0.516]

38 4 in return for improving the posterior probability for a hopelessly incorrect label from 0. [sent-87, score-0.179]

39 Thus, the objective retains the difficulties of the regular conditional log likelihood when dealing with difficult-to-classify outlier labels. [sent-90, score-0.351]

40 3 Maximum margin training The notion of Hamming distance is incorporated directly in the maximum margin training procedures of Taskar et al. [sent-93, score-0.662]

41 [9]: m Rmax margin (w : D) = C||w||2 + max 0, max ∆(y, y(t) ) − wT δF1,L (x(t) , y) t=1 y∈Y L , (5) and Tsochantaridis et al. [sent-94, score-0.192]

42 m Rmax margin (w : D) = C||w||2 + max 0, max ∆(y, y(t) ) 1 − wT δF1,L (x(t) , y) t=1 y∈Y L . [sent-96, score-0.168]

43 In the former formulation, loss is incurred when the Hamming distance between the correct parse y(t) and a candidate parse y exceeds the obtained classification margin between y(t) and y. [sent-98, score-0.541]

44 In the latter formulation, the amount of loss for a margin violation scales linearly with the Hamming distance betweeen y(t) and y. [sent-99, score-0.175]

45 Both cases lead to convex optimization problems in which the loss incurred for a particular training example is an upper bound on the Hamming loss between the correct parse and its highest scoring alternative. [sent-100, score-0.574]

46 In practice, however, this upper bound can be quite loose; thus, parameters obtained via a maximum margin framework may be poor minimizers of empirical risk. [sent-101, score-0.269]

47 2 Training for maximum labelwise accuracy In each of the likelihood-based or margin-based objective functions introduced in the previous subsections, difficulties arose due to the mismatch between the chosen objective function and our notion of empirical risk as defined by Hamming loss. [sent-103, score-1.121]

48 In this section, we demonstrate how to construct a smooth objective function for maximum expected accuracy parsing which more closely approximates our desired notion of empirical risk. [sent-104, score-0.708]

49 1 The labelwise accuracy objective function Consider the following objective function, m L (t) 1 yj = arg max Pw (yj | x(t) ) . [sent-107, score-1.349]

50 R(w : D) = (7) yj t=1 j=1 Maximizing this objective is equivalent to minimizing empirical risk under the Hamming loss (i. [sent-108, score-0.837]

51 To obtain a smooth approximation to this objective function, (t) we can express the condition that the algorithm predicts the correct label for yj in terms of the posterior probabilities of correct and incorrect labels as (t) Pw (yj | x(t) ) − max Pw (yj | x(t) ) > 0. [sent-111, score-1.039]

52 (t) (8) yj =yj Substituting equation (8) back into equation (7) and replacing the indicator function with a generic function Q(·), we obtain m L (t) Q Pw (yj | x(t) ) − max Pw (yj | x(t) ) . [sent-112, score-0.558]

53 Rlabelwise (w) = (t) t=1 j=1 (9) yj =yj When Q(·) is chosen to be the indicator function, Q(x) = 1{x > 0}, we recover the original objective. [sent-113, score-0.481]

54 By choosing a nicely behaved form for Q(·), however, we obtain a new objective that is easier to optimize. [sent-114, score-0.154]

55 (10) 1 + exp(−λx) As λ → ∞, Q(x; λ) → 1{x > 0}, so Rlabelwise (w : D) approaches the objective function defined in (7). [sent-116, score-0.154]

56 Because of this, we are free to use gradient-based optimization to maximize our new objective function. [sent-118, score-0.202]

57 As λ get larger, the quality of our approximation of the ideal Hamming loss objective improves; however, the approximation itself also becomes less smooth and perhaps more difficult to optimize as a result. [sent-119, score-0.344]

58 Thus, the value of λ controls the trade-off between the accuracy of the approximation and the ease of optimization. [sent-120, score-0.242]

59 2 The labelwise accuracy objective gradient We now present an algorithm for efficiently calculating the gradient of the approximate accuracy (t) (t) objective. [sent-123, score-0.983]

60 For a fixed parameter set w, let yj denote the label other than yj that has the maximum ˜ posterior probability at position j. [sent-124, score-1.166]

61 As with log-barrier optimization, performing the maximization of Rlabelwise (w : D) using a small value of λ, and gradually increasing λ while using the previous solution as a starting point for the new optimization, provides a viable technique for maximizing the labelwise accuracy objective. [sent-126, score-0.526]

62 y y (11) t=1 j=1 (t) (t) Using equation (1), the inner term, Pw (yj | x(t) ) − Pw (˜j | x(t) ), is equal to y 1 (t) (t) · 1{yj = yj } − 1{yj = yj } · exp wT F1,L (x(t) , y) . [sent-132, score-0.974]

63 Most of the terms involved in the gradient are easy to compute using the standard forward and backward matrices used for regular CRF inference, which we define here as α(i, j) = y1:j 1{yj = i} · exp wT F1,j (x(t) , y) T (13) (t) β(i, j) = yj:L 1{yj = i} · exp w Fj+1,L (x , y) . [sent-134, score-0.193]

64 (14) The two difficult terms that do not follow from the forward and backward matrices have the form, L ⋆ Q′ (w) · 1{yk = yk } · F1,L (x(t) , y) · exp wT F1,L (x(t) , y) , k (15) k=1 y1:L (t) (t) ˜ where Q′ (w) = Q′ Pw (yj | x(t) ) − Pw (˜j | x(t) ) and y⋆ is either y(t) or y(t) . [sent-135, score-0.178]

65 To efficiently y j compute terms of this type, we define j ⋆ 1{yk = yk ∧ yj = i} · Q′ (w) · exp wT F1,j (x(t) , y) k α⋆ (i, j) = (16) k=1 y1:j L ⋆ 1{yk = yk ∧ yj = i} · Q′ (w) · exp wT Fj+1,L (x(t) , y) . [sent-136, score-1.144]

66 The remaining entries are given by the following recurrences: ⋆ α⋆ (i′ , j − 1) + 1{i = yj } · α(i′ , j − 1) · Q′ (w) · ew j α⋆ (i, j) = T f (i′ ,i,x(t) ,j) (18) i′ ⋆ β ⋆ (i′ , j + 1) + 1{i′ = yj+1 } · β(i′ , j + 1) · Q′ (w) · ew j+1 β ⋆ (i, j) = T f (i,i′ ,x(t) ,j+1) . [sent-139, score-0.517]

67 (19) i′ It follows that equation (15) is equal to L f (i′ , i, x(t) , j) · exp wT f (i′ , i, x(t) , j) · (A + B), j=1 i′ (20) i where A = α⋆ (i′ , j − 1) · β(i, j) + α(i′ , j − 1) · β ⋆ (i, j) B = 1{i = ⋆ yj } ′ · α(i , j − 1) · β(i, j) · Q′ (w). [sent-140, score-0.519]

68 One could replace the max with a softmax function, and assuming unique probabilities for each candidate label, the gradient approaches (11) as the softmax function approaches the max. [sent-144, score-0.208]

69 5 We note that the “trick” used in the formulation of approximate accuracy is applicable to a variety of other (t) forms and arguments for Q(·). [sent-146, score-0.239]

70 In particular, if we change its argument to Pw (yj | x(t) ), letting Q(x) = log(x) gives the pointwise logloss formulation of Kakade et al. [sent-147, score-0.301]

71 2), while letting Q(x) = x gives an objective function equal to expected accuracy. [sent-150, score-0.189]

72 Panel (b) shows the proportion of state labels correctly predicted by the learned models at varying levels of label noise. [sent-171, score-0.191]

73 1 Simulation experiments To test the performance of the approximate labelwise accuracy objective function, we first ran simulation experiments in order to assess the robustness of several different learning algorithms in problems with a high degree of label noise. [sent-174, score-0.798]

74 Given a fixed noise parameter p ∈ [0, 1], we generated training sequence labels by flipping each run of consecutive ‘C’ hidden state labels to ‘I’ with probability p. [sent-176, score-0.352]

75 Figure 1b indicates the proportion of labels correctly identified by four different methods at varying noise levels: a generative model trained with joint log-likelihood, a CRF trained with conditional log-likelihood, the maximum-margin method of Taskar et al. [sent-178, score-0.31]

76 [9] as implemented in the SVMstruct package [10]6 , and a CRF trained with maximum labelwise accuracy. [sent-179, score-0.461]

77 No method outperforms maximum labelwise accuracy at any noise level. [sent-180, score-0.692]

78 05, maximum labelwise accuracy performs significantly better than the other methods. [sent-182, score-0.644]

79 The maximum margin method performed best when Viterbi decoding was used, while the other three methods had better performance with MEA decoding. [sent-184, score-0.292]

80 Interestingly, with no noise present, maximum margin training with Viterbi decoding peformed significantly better than generative training with Viterbi decoding (0. [sent-185, score-0.744]

81 710), but this was still much worse than generative training with MEA decoding (0. [sent-188, score-0.3]

82 2 Gene prediction experiments To test the performance of maximum labelwise accuracy training on a large-scale, real world problem, we trained a CRF to predict protein coding genes in the genome of the fruit fly Drosophila melanogaster. [sent-191, score-0.853]

83 Three separate training runs were performed, using three different objective functions: maximum 6 We were unable to get SVMstruct to converge on our test problem when using the Tsochantaridis et al. [sent-196, score-0.45]

84 66 0 10 20 30 40 -3 50 Iterations Figure 2: Panel (a) compares three pointwise loss functions in the special case where a label has two possible values. [sent-247, score-0.361]

85 The green curve (f (x) = − log( 1−x )) depicts pointwise logloss; the red curve rep2 resents the ideal zero-one loss; and the blue curve gives the sigmoid approximation with parameter 15. [sent-248, score-0.257]

86 Panels (b), (c), and (d) show gene prediction learning curves using three training objective functions: (b) maximum labelwise (approximate) accuracy, (c) maximum conditional log-likelihood, and (d) maximum pointwise conditional log-likelihood, respetively. [sent-249, score-1.357]

87 7 Figures 2b, 2c, and 2d show the value of the objective function and the average label accuracy at each iteration of the three training runs. [sent-253, score-0.617]

88 Here, maximum accuracy training improves upon the accuracy of the original generative parameters and outperforms the other two training objectives. [sent-254, score-0.914]

89 In contrast, maximum likelihood training and maximum pointwise likelihood training both give worse performance than the simple generative parameter estimates. [sent-255, score-0.916]

90 Evidently, for this problem the likelihood-based functions are poor surrogate measures for per-label accuracy: Figures 2c and 2d show declines in training and testing set accuracy, despite increases in the objective function. [sent-256, score-0.471]

91 5 Discussion and related work In contrast to most previous work describing alternative objective functions for CRFs, the method described in this paper optimizes a direct approximation of the Hamming loss. [sent-257, score-0.203]

92 For binary classifiers, Jansche showed that an algorithm designed to optimize F-measure performance of a logistic regression model for information extraction outperforms maximum likelihood training [14]. [sent-259, score-0.388]

93 After this work was submitted for consideration, a Minimum Classification Error (MCE) method for training CRFs to minimize empirical risk was independently proposed by Suzuki et al. [sent-263, score-0.325]

94 This technique minimizes the loss incurred by maximum a posteriori, rather than maximum expected accuracy, parsing on the training set. [sent-265, score-0.663]

95 In practice, Viterbi parsers often achieve worse per-label accuracy than maximum expected accuracy parsers [3, 4, 5]; we are currently exploring whether a similar relationship also exists between MCE methods and our proposed training objective. [sent-266, score-0.844]

96 The training method described in this work is theoretically attractive, as it addresses the goal of empirical risk minimization in a very direct way. [sent-267, score-0.301]

97 In addition to its theoretical appeal, we have shown that it performs much better than maximum likelihood and maximum pointwise likelihood training on a large scale, real world problem. [sent-268, score-0.696]

98 Furthermore, our method is efficient, having time complexity approximately three times that of maximum likelihood likelihood training, and easily parallelizable, as each training example can be considered independently when evaluating the objective function or its gradient. [sent-269, score-0.568]

99 In practice, this can be combatted by initializing the optimization with a parameter vector obtained by a convex training method. [sent-271, score-0.202]

100 Investigating loss functions and optimization methods for discriminative learning of label sequences. [sent-327, score-0.245]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yj', 0.455), ('pw', 0.406), ('labelwise', 0.311), ('accuracy', 0.215), ('crf', 0.211), ('hamming', 0.211), ('pointwise', 0.164), ('training', 0.154), ('objective', 0.154), ('parse', 0.122), ('maximum', 0.118), ('rlabelwise', 0.111), ('wt', 0.099), ('parsing', 0.098), ('conditional', 0.095), ('margin', 0.094), ('label', 0.094), ('crfs', 0.093), ('viterbi', 0.091), ('risk', 0.09), ('logloss', 0.089), ('loss', 0.081), ('decoding', 0.08), ('stanford', 0.079), ('yk', 0.073), ('testing', 0.071), ('likelihood', 0.071), ('surrogate', 0.07), ('flybase', 0.067), ('mea', 0.067), ('recurrences', 0.067), ('labeling', 0.066), ('parses', 0.066), ('labels', 0.063), ('incurred', 0.059), ('empirical', 0.057), ('taskar', 0.05), ('kakade', 0.049), ('optimization', 0.048), ('sequence', 0.045), ('suzuki', 0.044), ('posterior', 0.044), ('gradient', 0.044), ('exp', 0.044), ('tsochantaridis', 0.042), ('incorrect', 0.041), ('svmstruct', 0.039), ('parallelizable', 0.039), ('olga', 0.039), ('nonconvex', 0.039), ('parsers', 0.039), ('generative', 0.037), ('max', 0.037), ('mce', 0.035), ('emnlp', 0.035), ('och', 0.035), ('expected', 0.035), ('probabilities', 0.035), ('predicted', 0.034), ('backward', 0.034), ('candidate', 0.034), ('acl', 0.033), ('rmax', 0.033), ('trained', 0.032), ('log', 0.031), ('ew', 0.031), ('smooth', 0.031), ('gene', 0.03), ('softmax', 0.029), ('worse', 0.029), ('correct', 0.029), ('elds', 0.028), ('posteriori', 0.028), ('approximation', 0.027), ('forward', 0.027), ('usa', 0.027), ('noise', 0.027), ('indicator', 0.026), ('dif', 0.026), ('formulation', 0.024), ('procedures', 0.024), ('incorrectly', 0.024), ('optimize', 0.024), ('et', 0.024), ('simulation', 0.024), ('arg', 0.023), ('genome', 0.023), ('fj', 0.022), ('unseen', 0.022), ('panel', 0.022), ('curve', 0.022), ('functions', 0.022), ('suboptimal', 0.022), ('labeled', 0.022), ('predictive', 0.021), ('posteriors', 0.021), ('base', 0.021), ('outperforms', 0.021), ('culties', 0.021), ('equation', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

2 0.20786308 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan

Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1

3 0.18218315 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online hypothesis by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. 1

4 0.17103271 117 nips-2006-Learning on Graph with Laplacian Regularization

Author: Rie K. Ando, Tong Zhang

Abstract: We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classification performance.

5 0.14288706 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

Author: Fei Sha, Lawrence K. Saul

Abstract: We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradient-based methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus.

6 0.12498555 121 nips-2006-Learning to be Bayesian without Supervision

7 0.11535446 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

8 0.10089382 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

9 0.094331272 122 nips-2006-Learning to parse images of articulated bodies

10 0.085571267 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

11 0.081470065 186 nips-2006-Support Vector Machines on a Budget

12 0.079172038 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

13 0.076553062 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

14 0.075806528 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

15 0.074361123 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

16 0.073155865 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

17 0.069881633 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

18 0.067523889 130 nips-2006-Max-margin classification of incomplete data

19 0.066784479 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

20 0.066714592 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.214), (1, 0.068), (2, -0.021), (3, -0.039), (4, -0.031), (5, 0.134), (6, -0.042), (7, -0.079), (8, -0.074), (9, -0.115), (10, -0.169), (11, -0.038), (12, -0.207), (13, 0.127), (14, -0.093), (15, 0.098), (16, 0.105), (17, -0.058), (18, 0.06), (19, -0.078), (20, 0.187), (21, -0.005), (22, -0.144), (23, -0.236), (24, -0.031), (25, 0.013), (26, -0.005), (27, -0.097), (28, 0.097), (29, -0.017), (30, -0.177), (31, 0.04), (32, -0.14), (33, -0.048), (34, -0.017), (35, -0.139), (36, -0.097), (37, -0.004), (38, -0.059), (39, 0.036), (40, 0.102), (41, -0.009), (42, 0.003), (43, 0.084), (44, 0.073), (45, 0.034), (46, 0.042), (47, 0.025), (48, 0.018), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95996374 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

2 0.73210907 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan

Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1

3 0.65117329 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

Author: Yi Mao, Guy Lebanon

Abstract: We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. Using the M¨ bius transform, we express the model o as a simple convex optimization problem. Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization. 1

4 0.63341141 117 nips-2006-Learning on Graph with Laplacian Regularization

Author: Rie K. Ando, Tong Zhang

Abstract: We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classification performance.

5 0.53205723 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

Author: Gunnar Rätsch, Sören Sonnenburg

Abstract: We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology. 1

6 0.51708317 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

7 0.45303744 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

8 0.42426381 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

9 0.42300469 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

10 0.38620049 170 nips-2006-Robotic Grasping of Novel Objects

11 0.38604295 121 nips-2006-Learning to be Bayesian without Supervision

12 0.34688541 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

13 0.32457486 122 nips-2006-Learning to parse images of articulated bodies

14 0.32036468 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

15 0.31143627 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes

16 0.3103936 47 nips-2006-Boosting Structured Prediction for Imitation Learning

17 0.28836298 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

18 0.28355241 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

19 0.28258467 156 nips-2006-Ordinal Regression by Extended Binary Classification

20 0.281142 130 nips-2006-Max-margin classification of incomplete data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.122), (3, 0.027), (7, 0.091), (9, 0.032), (12, 0.027), (20, 0.01), (22, 0.113), (44, 0.052), (57, 0.095), (65, 0.074), (69, 0.042), (71, 0.011), (90, 0.051), (94, 0.18)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87406641 21 nips-2006-AdaBoost is Consistent

Author: Peter L. Bartlett, Mikhail Traskin

Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1

same-paper 2 0.86400795 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

3 0.76004881 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. The goal of the first player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u ∈ S, we define the regret of the first player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u ∈ S. Specifically, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is defined as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is defined as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The first update scheme simply finds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update defines (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the final value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisfies the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its final sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The definition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the definition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classification, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the confidence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are defined to be yt = Xt wt . Finally, let us define two loss functions for ranking, both generalize the hinge-loss used in binary classification problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we define a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufficient confidence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufficient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster first defines a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a final, so called strong, hypothesis whose error is small. The final hypothesis combines linearly the T hypotheses returned by the weak learner with coefficients α1 , . . . , αT , and is defined to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefficients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the final strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note first that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, defined as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms define weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally fits the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us first denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us define αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: find a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of finding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster first constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different definition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The definition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the definition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now briefly discuss. First, we generalize the applicability of the framework beyond the specific setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was first presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of defining appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simplifies the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.

4 0.75739837 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

Author: Fei Sha, Lawrence K. Saul

Abstract: We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradient-based methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus.

5 0.7516728 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

6 0.75126666 20 nips-2006-Active learning for misspecified generalized linear models

7 0.74771726 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

8 0.74521601 65 nips-2006-Denoising and Dimension Reduction in Feature Space

9 0.74464238 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

10 0.74336028 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

11 0.74034798 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

12 0.73998874 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

13 0.73970032 79 nips-2006-Fast Iterative Kernel PCA

14 0.73968291 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

15 0.73957294 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

16 0.73894608 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

17 0.73846686 131 nips-2006-Mixture Regression for Covariate Shift

18 0.73752344 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

19 0.7370472 203 nips-2006-implicit Online Learning with Kernels

20 0.73641807 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions