nips nips2002 nips2002-98 nips2002-98-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1
[1] A.KJain, M.N. Murty, and P.J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264- 323, 1999.
[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Bioi., 215:403 - 410, 1990.
[3] F. Corpet, F. Servant, J. Gouzy, and D. Kahn. Prodom and prodom-cg: tools for protein domain analysis and whole genome comparisons. Nucleid Acids Res., 28:267269, 2000.
[4] T. F. Cox and M. A. A. Cox. Multidimensional Scaling. Chapman & Hall, London, 2001.
[5] R.O. Duda, P.E.Hart, and D.G.Stork. Pattern classification. John Wiley & Sons, second edition, 2001.
[6] P. J. Huber. Projection pursuit. The Annals of Statistics, pages 435--475, 1985.
[7] H. Kasai, A. Bairoch, K Watanabe, K Isono, and S. Harayama. Construction of the gyrb database for the identification and classification of bacteria. Genome Informatics, pages 13 - 21, 1998.
[8] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 1995.
[9] S. Mika, B. SchOlkopf, A.J. Smola, K-R. Miiller, M. Scholz, and G. Ratsch. Kernel PCA and de- noising in feature spaces. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 536542. MIT Press, 1999.
[10] A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia. Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Bioi., 247:536- 540, 1995.
[11] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence analysis. Proc. Natl. Acad. Sci, 85:2444 - 2448, 1988.
[12] J. Puzicha, T. Hofmann, and J. Buhmann. A theory of proximity based clustering: Structure detection by optimization. Pattern Recognition, 33(4):617- 634, 1999.
[13] V. Roth, M. Braun, T. Lange, and J. Buhmann. A resampling approach to cluster validation. In Computational Statistics-COMPSTAT'02, 2002. To appear.
[14] V. Roth, J. Laub, M. Kawanabe, and J.M. Buhmann. Optimal cluster preserving embedding of non-metric proximity data. Technical Report IAI-TR-2002-5, University of Bonn, 2002.
[15] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323-2326, 2000.
[16] B. Schiilkopf, A. Smola, and K-R. Miiller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299- 1319, 1998.
[17] J.B. Tenenbaum, V. Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319- 2323, 2000.
[18] K Tsuda, T. Kin, and K Asai. Marginalized kernels for biological sequences. Proc. ISMB, to appear:2002 , http://www.cbrc.jp/ tsuda/.
[19] K Watanabe, J. Nelson, S. Harayama, and H. Kasai. Icb database: the gyrb database for identification and classification of bacteria. Nucleic Acids Res., 29:344 - 345, 2001.