jmlr jmlr2011 jmlr2011-16 jmlr2011-16-reference knowledge-graph by maker-knowledge-mining

16 jmlr-2011-Clustering Algorithms for Chains

Source: pdf

Author: Antti Ukkonen

Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a ﬁnite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-speciﬁc scores are either difﬁcult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The ﬁrst one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the signiﬁcance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artiﬁcial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing

reference text

N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. In Proceedings of the 37th ACM Symposium on Theory of Computing, pages 684–693, 2005. E. Alpaydin. Introduction to Machine Learning. The MIT Press, 2004. D Arthur and S Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035, 2007. G. H. Ball and D. J. Hall. A clustering technique for summarizing multivariate data. Behavioral Science, 12:153–155, 1967. A. Ben-Dor, B. Chor, R. Karp, and Z. Yakhini. Discovering local structure in gene expression data: the order-preserving submatrix problem. In Proceedings of the Sixth Annual International Conference on Computational Biology, pages 49–57, 2002. P Berkhin. Grouping Multidimensional Data, chapter A Survey of Clustering Data Mining Techniques, pages 25–71. Springer, 2006. 1421 U KKONEN J. Besag and P. Clifford. Generalized Monte Carlo signiﬁcance tests. Biometrika, 76(4):633–642, 1989. J. Besag and P. Clifford. Sequential Monte Carlo p-values. Biometrika, 78(2):301–304, 1991. L. M. Busse, P. Orbanz, and J. M. Buhmann. Cluster analysis of heterogeneous rank data. In Proceedings of the 24th international conference on Machine learning, pages 113–120, 2007. S Cl´ mencon and J Jakubowicz. Kantorovich distances between rankings with applications to rank e ¸ aggregation. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, 2010. A. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18(2):116–140, 2001. D. Coppersmith, L. Fleischer, and A. Rudra. Ordering by weighted number of wins gives a good ranking for weighted tournaments. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 776–782, 2006. D. Critchlow. Metric Methods for Analyzing Partially Ranked Data, volume 34 of Lecture Notes in Statistics. Springer-Verlag, 1985. R. O. Duda and P. E. Hart. Pattern Classiﬁcation and Scene Analysis. John Wiley & Sons, 1973. C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the 10th International World Wide Web Conference, 2001. A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Texts in Statistical Science. Chapman & Hall, CRC, 2004. A. Gionis, H. Mannila, T. Mielik¨ inen, and P. Tsaparas. Assessing data mining results via swap a randomization. ACM Transactions on Knowledge Discovery from Data, 1(3), 2007. P I Good. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses, volume 2 of Springer series in statistics. Springer, 2000. R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics. Addison-Wesley, 2nd edition, 1994. D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. The MIT Press, 2001. T. Kamishima and S. Akaho. Efﬁcient clustering for orders. In Workshops Proceedings of the 6th IEEE International Conference on Data Mining, pages 274–278, 2006. T. Kamishima and S. Akaho. Mining Complex Data, volume 165 of Studies in Computational Intelligence, chapter Efﬁcient Clustering for Orders, pages 261–279. Springer, 2009. T. Kamishima and J. Fujiki. Clustering orders. In Proceedings of the 6th International Conference on Discovery Science, pages 194–207, 2003. P Kidwell, G Lebanon, and W S Cleveland. Visualizing incomplete and partially ranked data. IEEE Trans. Vis. Comput. Graph., 14(6):1356–1363, 2008. 1422 C LUSTERING A LGORITHMS FOR C HAINS H Lawrence and A Phipps. Comparing partitions. Journal of Classiﬁcation, 2:193–218, 1985. S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2): 129–137, 1982. H. Moulin. Axioms of Cooperative Decision Making. Cambride Universiy Press, 1991. T. B. Murphy and D. Martin. Mixtures of distance-based models for ranking data. Computational Statistics & Data Analysis, 41:645–655, 2003. W M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971. R. Shamir and D. Tsur. Improved algorithms for the random cluster graph model. In Proceedings of Scandanavian Workshop on Algorithms Theory, pages 230–239, 2002. N. Tideman. The single transferable vote. Journal of Economic Perspectives, 9(1):27–38, 1995. A. Ukkonen. Visualizing sets of partial rankings. In Advances in Intelligent Data Analysis VII, pages 240–251, 2007. A. Ukkonen. Algorithms for Finding Orders and Analyzing Sets of Chains. PhD thesis, Helsinki University of Technology, 2008. A. Ukkonen and H. Mannila. Finding outlying items in sets of partial rankings. In Knowledge Discovery in Databases: PKDD 2007, pages 265–276, 2007. R. Xu and D. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645–678, 2005. 1423