nips nips2000 nips2000-74 nips2000-74-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
[1] Nigam K. , McCallum A. , Thrun S., and Mitchell T. (2000) Text classification from labeled and unlabeled examples. Machine Learning 39 (2):103-134.
[2] Blum A., Mitchell T. (1998) Combining Labeled and Unlabeled Data with CoTraining. In Proc. 11th Annual Con! Computational Learning Theo ry, pp. 92-100.
[3] Vapnik V. (1998) Statistical learning theory. John Wiley & Sons.
[4] Joachims, T. (1999) Transductive inference for text classification using support vector machines. International Conference on Machine Learning.
[5] Jaakkola T., Meila M., and Jebara T. (1999) Maximum entropy discrimination. In Advances in Neural Information Processing Systems 12.
[6] Hofmann T., Puzicha 1. (1998) Unsupervised Learning from Dyadic Data. International Computer Science Institute, TR-98-042.
[7] Tong S., Koller D. (2000) Restricted Bayes Optimal Classifiers. Proceedings AAAI.
[8] Miller D., Uyar T. (1996) A Mixture of Experts Classifer with Learning Based on Both Labelled and Unlabelled Data. In Advances in Neural Information Processing Systems 9, pp. 571-577.
[9] Castelli v., Cover T. (1996) The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Transactions on information theory 42 (6): 2102-2117. A Maximum entropy solution The unique solution to the maximum entropy estimation problem is found via introducing Lagrange multipliers {AI} for the classification constraints. The multipliers satisfy Al E [0, el, where the lower bound comes from the inequality constraints and the upper bound from the linear margin penalties being minimized. To represent the solution and find the optimal setting of Al we must evaluate the partition function = e- ~f An N L II e~f = (5) = e- ~f An II (e~f Y1A1P(ilxd + e- ~f Y1A1P(ilxd ) (6) Z(A) iizA1YiP(ilxd N i=l that normalizes the maximum entropy distribution. Here Y denote the observed labels. Minimizing the jointly convex log-partition function log Z(A) with respect to the Lagrange multipliers leads to the optimal setting {Ai}. This optimization is readily done via an axis parallel line search (e.g. the bisection method). The required gradients are given by 01 Z(A) O~Ak N = -')' + ~ tanh ( tt L ) = (7) = -')'+Yk LEp;{YdP(ilxk) (8) Y1Aj P(ilxl) YkP(ilxk) N i=l (this is essentially the classification constraint). The expectation is taken with respect to the maximum entropy distribution P* (Yl , ... , YN) = Pi (Yl) .. . PN(y N) where the components are Pt(Yi) ex exp{L:1Y1A1YiP(ilx)}. The label averages wi = Ep.{Yd = L: Yi Yi Pt (Yi) are needed for the decision rule as well as in the optimization. We can identify these from above wi = tanh(L:l Y1Aj P(ilxl)) and they are readily evaluated. Finding the solution involves O(L2 N) operations. Often the numbers of positive and negative training labels are imbalanced. The MED formulation (analogously to SVMs) can be adjusted by defining the margin penalties as e+ L:l:Y1=1 6 + e- L:l:Y1=-1 ~l' where, for example, L+e+ = L-e- that equalizes the mean penalties. The coefficients e+ and e- can also be modified adaptively during the estimation process to balance the rate of misclassification errors across the two classes.