jmlr jmlr2007 jmlr2007-27 jmlr2007-27-reference knowledge-graph by maker-knowledge-mining

27 jmlr-2007-Distances between Data Sets Based on Summary Statistics

Source: pdf

Author: Nikolaj Tatti

Abstract: The concepts of similarity and distance are crucial in data mining. We consider the problem of deﬁning the distance between two data sets by comparing summary statistics computed from the data sets. The initial deﬁnition of our distance is based on geometrical notions of certain sets of distributions. We show that this distance can be computed in cubic time and that it has several intuitive properties. We also show that this distance is the unique Mahalanobis distance satisfying certain assumptions. We also demonstrate that if we are dealing with binary data sets, then the distance can be represented naturally by certain parity functions, and that it can be evaluated in linear time. Our empirical tests with real world data show that the distance works well. Keywords: data mining theory, complex data, binary data, itemsets

reference text

Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., 26–28 1993. Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and Aino I. Verkamo. Fast discovery of association rules. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press/The MIT Press, 1996. 153 TATTI Pierre Baldi, Paolo Frasconi, and Padhraic Smyth. Modeling the Internet and the Web. John Wiley & Sons, 2003. Michéle Baseville. Distance measures for signal processing and pattern recognition. Signal Processing, 18(4):349–369, 1989. Toon Calders. Axiomatization and Deduction Rules for the Frequency of Itemsets. PhD thesis, University of Antwerp, Belgium, 2003. Tadeusz Calinski and Jerzy Harabasz. A dendrite method for cluster analysis. Communications in Statistics, 3:1–27, 1974. Gregory Cooper. The computational complexity of probabilistic inference using Bayesian belief networks. Artiﬁcial Intelligence, 42(2–3):393–405, Mar. 1990. Imre Csiszár. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, Feb. 1975. David L. Davies and Donald W. Bouldin. A cluster separation measure. IEEE Transactions of Pattern Analysis and Machine Intelligence, 1(2):224–227, April 1979. Thomas Eiter and Heikki Mannila. Distance measures for point sets and their computation. Acta Informatica, 34(2):109–133, 1997. Theodore Hailperin. Best possible inequalities for the probability of a logical function of events. The American Mathematical Monthly, 72(4):343–359, Apr. 1965. David Hand, Heikki Mannila, and Padhraic Smyth. Principles of Data Mining. The MIT Press, 2001. Jaakko Hollmén, Jouni K Seppänen, and Heikki Mannila. Mixture models and frequent sets: Combining global and local methods for 0-1 data. In Proceedings of the SIAM Conference on Data Mining (2003), 2003. Solomon Kullback. Information Theory and Statistics. Dover Publications, Inc., 1968. Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady., 10(8):707–710, February 1966. Heikki Mannila, Dmitry Pavlov, and Padhraic Smyth. Prediction with local patterns using crossentropy. In Knowledge Discovery and Data Mining, pages 357–361, 1999. Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, 2002. Judea Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. 154