nips nips2004 nips2004-37 nips2004-37-reference knowledge-graph by maker-knowledge-mining

37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Source: pdf

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisﬁed in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufﬁcient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

reference text

[1] S. Abney. Bootstrapping. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 360–367, 2002.

[2] A. Blum and T. M. Mitchell. Combining labeled and unlabeled data with co-training. In Proc. 11th Annual Conference on Computational Learning Theory, pages 92–100, 1998.

[3] M. Collins and Y. Singer. Unsupervised models for named entity classiﬁcation. In SIGDAT Conf. Empirical Methods in NLP and Very Large Corpora, pages 189–196, 1999.

[4] S. Dasgupta, M. L. Littman, and D. McAllester. PAC generalization bounds for co-training. In Advances in Neural Information Processing Systems 14. MIT Press, 2001.

[5] R. Ghani. Combining labeled and unlabeled data for text classiﬁcation with a large number of categories. In Proceedings of the IEEE International Conference on Data Mining, 2001.

[6] T. Joachims. Transductive inference for text classiﬁcation using support vector machines. In Proceedings of the 16th International Conference on Machine Learning, pages 200–209, 1999.

[7] M. Kearns, M. Li, and L. Valiant. Learning Boolean formulae. JACM, 41(6):1298–1328, 1995.

[8] A. Levin, Paul Viola, and Yoav Freund. Unsupervised improvement of visual detectors using co-training. In Proc. 9th IEEE International Conf. on Computer Vision, pages 626–633, 2003.

[9] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.

[10] K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Proc. ACM CIKM Int. Conf. on Information and Knowledge Management, pages 86–93, 2000.

[11] K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell. Text classiﬁcation from labeled and unlabeled documents using em. Machine Learning, 39(2/3):103–134, 2000.

[12] S. Park and B. Zhang. Large scale unstructured document classiﬁcation using unlabeled data and syntactic information. In PAKDD 2003, LNCS vol. 2637, pages 88–99. Springer, 2003.

[13] D. Pierce and C. Cardie. Limitations of Co-Training for natural language learning from large datasets. In Proc. Conference on Empirical Methods in NLP, pages 1–9, 2001.

[14] R. Rivest and R. Sloan. Learning complicated concepts reliably and usefully. In Proceedings of the 1988 Workshop on Computational Learning Theory, pages 69–79, 1988.

[15] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics, pages 189–196, 1995.

[16] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian ﬁelds and harmonic functions. In Proc. 20th International Conf. Machine Learning, pages 912–912, 2003. A Relating the deﬁnitions We show here how Deﬁnition 2 implies Deﬁnition 1. Theorem 2 If D+ satisﬁes -left-right expansion (Deﬁnition 2), then it also satisﬁes (Deﬁnition 1) for = /(1 + ). -expansion + + Proof: We will prove the contrapositive. Suppose there exist S1 ⊆ X1 , S2 ⊆ X2 such that min Pr(S1 ∧ S2 ), Pr(S1 ∧ S2 ) . Assume without loss of generality that Pr(S1 ⊕ S2 ) < Pr(S1 ∧ S2 ) ≤ Pr(S1 ∧ S2 ). Since Pr(S1 ∧ S2 ) + Pr(S1 ∧ S2 ) + Pr(S1 ⊕ S2 ) = 1 it follows that Pr(S1 ∧ S2 ) ≤ 1 − Pr(S1 ⊕S2 ) . Assume Pr(S1 ) ≤ Pr(S2 ). This implies that Pr(S1 ) ≤ 1 2 2 2 since Pr(S1 )+Pr(S2 ) = 2 Pr(S1 ∧S2 )+Pr(S1 ⊕S2 ) and so Pr(S1 ) ≤ Pr(S1 ∧S2 )+ Pr(S1 ⊕S2 ) . 2 Now notice that Pr(S1 ∧ S2 ) Pr(S1 ∧ S2 ) 1 ≥ > Pr(S2 |S1 ) = ≥1− . Pr(S1 ) Pr(S1 ∧ S2 ) + Pr(S1 ⊕ S2 ) 1+ But Pr(S2 ) ≤ Pr(S1 ∧ S2 ) + Pr(S1 ⊕ S2 ) < (1 + ) Pr(S1 ∧ S2 ) ≤ (1 + ) Pr(S1 ) and so Pr(S2 ) < (1 + ) Pr(S1 ). Similarly if Pr(S2 ) ≤ Pr(S1 ) we get a failure of expansion in the other direction. This completes the proof.