jmlr jmlr2006 jmlr2006-78 jmlr2006-78-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sharlee Climer, Weixiong Zhang
Abstract: Given a matrix of values in which the rows correspond to objects and the columns correspond to features of the objects, rearrangement clustering is the problem of rearranging the rows of the matrix such that the sum of the similarities between adjacent rows is maximized. Referred to by various names and reinvented several times, this clustering technique has been extensively used in many fields over the last three decades. In this paper, we point out two critical pitfalls that have been previously overlooked. The first pitfall is deleterious when rearrangement clustering is applied to objects that form natural clusters. The second concerns a similarity metric that is commonly used. We present an algorithm that overcomes these pitfalls. This algorithm is based on a variation of the Traveling Salesman Problem. It offers an extra benefit as it automatically determines cluster boundaries. Using this algorithm, we optimally solve four benchmark problems and a 2,467-gene expression data clustering problem. As expected, our new algorithm identifies better clusters than those found by previous approaches in all five cases. Overall, our results demonstrate the benefits of rectifying the pitfalls and exemplify the usefulness of this clustering technique. Our code is available at our websites. Keywords: clustering, visualization of patterns in data, bond energy algorithm, traveling salesman problem, asymmetric clustering
C. J. Alpert. Multi-way Graph and Hypergraph Partitioning. PhD thesis, UCLA, Los Angeles, CA, 1996. C. J. Alpert and A. B. Kahng. Splitting an ordering into a partition to minimize diameter. Journal of Classification, 14:51–74, 1997. D. Applegate, R. Bixby, V. Chv´ tal, and W. Cook. TSP cuts which do not conform to the template a paradigm. In M. Junger and D. Naddef, editors, Computational Combinatorial Optimization, pages 261–304. Springer, 2001. P. Arabie and L. J. Hubert. The bond energy algorithm revisited. IEEE Trans. Systems, Man, and Cybernetics, 20(1):268–74, 1990. P. Arabie, S. Schleutermann, J. Daws, and L. Hubert. Marketing applications of sequencing and partitioning of nonsymmetric and/or two-mode matrices. In W. Gaul and M. Schader, editors, Data Analysis, Decision Support, and Expert Knowledge Representation in Marketing, pages 215–24. Springer Verlag, 1988. S. Arora. Approximation algorithms for geometric TSP. In G. Gutin and A. Punnen, editors, The Traveling Salesman Problem and its Variations. Kluwer Academic, Norwell, MA, 2002. P. Baldi and G. W. Hatfield. DNA Microarrays and Gene Expression. Cambridge University Press, 2002. G. Carpaneto, M. Dell’Amico, and P. Toth. Exact solution of large-scale, asymmetric Traveling Salesman Problems. ACM Trans. on Mathematical Software, 21:394–409, 1995. H. M. Chan and D. A. Milner. Direct clustering algorithm for group formation in cellular manufacturing. Journal of Manufacturing Systems, 1:65–74, 1982. S. Chu, J. L. DeRisi, M. B. Eisen, J. Mulholland, D. Botstein, P. O. Brown, and I. Herskowitz. The transcriptional program of sporulation in budding yeast. Science, 282:699–705, 1998. S. Climer and W. Zhang. Cut-and-solve: An iterative search strategy for combinatorial optimization problems. Artificial Intelligence, 170:714–738, June 2006. S. Climer and W. Zhang. Take a walk and cluster genes: A TSP-based approach to optimal rearrangement clustering. In 21st International Conference on Machine Learning (ICML’04), pages 169–176, Banff, Canada, July 2004. W. Cook. Traveling Salesman Problem. http://www.tsp.gatech.edu, web. 940 R EARRANGEMENT C LUSTERING J. L. DeRisi, V. R. Iyer, and P. O. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278:680–686, 1997. M. H. Dunham. Data Mining: Introductory and Advanced Topics. Prentice-Hall, Upper Saddle River, NJ, 2003. M. B. Eisen, P. T. Spellman, P.O Brown, and D. Botstein. Cluster analysis and display of genomewide expression patterns. Proc. of the Natl. Acad. of Sciences, 95(25):14863–8, 1998. M. Fischetti, A. Lodi, and P. Toth. Exact methods for the Asymmetric Traveling Salesman Problem. In G. Gutin and A. Punnen, editors, The Traveling Salesman Problem and its Variations. Kluwer Academic, Norwell, MA, 2002. N. Gorla and K. Zhang. Deriving program physical structures using bond energy algorithm. In Proc. 6th Asia Pacific Software Engineering Conference, 1999. G. Gutin and A. P. Punnen. The Traveling Salesman Problem and its variations. Kluwer Academic Publishers, Norwell, MA, 2002. H. A. Hoffer and D. G. Severance. The use of cluster analysis in physical data base design. In Proceedings of the 1st International Conference on Very Large Data Bases, pages 69–86, September 1975. D. S. Johnson. 8th DIMACS implementation challenge. http://www.research.att.com/ ˜dsj/chtsp/, web. D. S. Johnson and L. A. McGeoch. Experimental analysis of heuristics for the STSP. In G. Gutin and A. Punnen, editors, The Traveling Salesman Problem and its Variations, pages 369–443. Kluwer Academic, Norwell, MA, 2002. D. S. Johnson, G. Gutin, L. A. McGeoch, A. Yeo, W. Zhang, and A. Zverovich. Experimental analysis of heuristics for the ATSP. In G. Gutin and A. Punnen, editors, The Traveling Salesman Problem and its Variations. Kluwer Academic, Norwell, MA, 2002. D. S. Johnson, S. Krishnan, J. Chhugani, S. Kumar, and S. Venkatasubramanian. Compressing large boolean matrices using reordering techniques. In 30th Int. Conf. on Very Large Databases (VLDB), pages 13–23, 2004. R. Jonker and T. Volgenant. Transforming asymmetric into symmetric traveling salesman problems. Operations Research Letters, 2:161–163, 1983. R. M. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85–103. Plenum Press, New York, 1972. J. R. King. Machine-component grouping in production flow analysis: an approach using a rank order clustering algorithm. International Journal of Production Research, 18(2):213–32, 1980. A. Kusiak. Analysis of integer programming formulations of clustering problems. Image and Vision Computing, 2:35–40, 1984. 941 C LIMER AND Z HANG A. Kusiak. Flexible manufacturing systems: A structural approach. Int. J. Production Res., 23: 1057–73, 1985. E. L. Lawler, J. K. Lenstra, A. H. G. Rinnooy Kan, and D. B. Shmoys. The Traveling Salesman Problem. John Wiley & Sons, Essex, England, 1985. J. K. Lenstra. Clustering a data array and the Traveling Salesman Problem. Operations Research, 22(2):413–4, 1974. J. K. Lenstra and A. H. G. Rinnooy Kan. Some simple applications of the Travelling Salesman Problem. Operational Research Quarterly, 26(4):717–733, 1975. A. Lim, B. Rodrigues, and X. Zhang. Metaheuristics with local search techniques for retail shelfspace optimization. Informs, 50(1):117 –131, 2004. S. Lin and B. W. Kernighan. An effective heuristic algorithm for the traveling salesman problem. Operations Research, 21:498–516, 1973. Y. Liu, B. J. Ciliax, A. Pivoshenko, J. Civera, V. Dasigi, A. Ram, R. Dingledine, and S. B. Navathe. Evaluation of a new algorithm for keyword-based functional clustering of genes. In 8th International Conf. on Research in Computational Molecular Biology (RECOMB-04), San Diego, CA, March 2004. Poster paper. A. Lodi and A. P. Punnen. TSP software. In G. Gutin and A. Punnen, editors, The Traveling Salesman Problem and its Variations. Kluwer Academic, Norwell, MA, 2002. S. T. March. Techniques for structuring data base records. Computing Surveys, 15:45–79, 1983. F. Marcotorchino. Block seriation problems: A unified approach. Appl. Stochastic Models and Data Analysis, 3:73–91, 1987. W. T. McCormick, P. J. Schweitzer, and T. W. White. Problem decomposition and data reorganization by a clustering technique. Operations Research, 20:993–1009, 1972. P. Moscato. TSPBIB. http://www.densis.fee.unicamp .br/˜moscato/TSPBIB_home.html, web. S. Navathe, S. Ceri, G. Wiederhold, and J. Dou. Vertical partitioning of algorithms for database design. ACM Trans. Database Syst., 9(4):680–710, 1984. M. T. Ozsu and P. Valduriez. Principles of distributed database systems. Prentice Hall, Upper Saddle River, NJ, 2nd edition, 1999. A. P. Punnen. The Traveling Salesman Problem: applications, formulations, and variations. In G. Gutin and A. Punnen, editors, The Traveling Salesman Problem and its Variations. Kluwer Academic, Norwell, MA, 2002. P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12): 3273–97, 1998. 942 R EARRANGEMENT C LUSTERING M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. Technical Report 00-034, University of Minnesota, 2000. R. Torres-Velzquez and V. Estivill-Castro. Local search for hamiltonian path with applications to clustering visitation paths. Journal of the Operational Research Society, 55(7):737–748, 2004. 943