jmlr jmlr2010 jmlr2010-7 jmlr2010-7-reference knowledge-graph by maker-knowledge-mining

7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

Source: pdf

Author: Yael Ben-Haim, Elad Tom-Tov

Abstract: We propose a new algorithm for building decision tree classiﬁers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classiﬁer, while being scalable for processing of streaming data on multiple processors. These ﬁndings are supported by a rigorous analysis of the algorithm’s accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a ﬁxed amount of memory. A master processor uses this information to ﬁnd near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy. Keywords: decision tree classiﬁers, distributed computing, streaming data, scalability

reference text

Rakesh Agrawal and Arun Swami. A one-pass space-efﬁcient algorithm for ﬁnding quantiles. In Proceedings of COMAD, Pune, India, 1995. Khaled AlSabti, Sanjay Ranka, and Vineet Singh. CLOUDS: Classiﬁcation for large or out-of-core datasets. In Conference on Knowledge Discovery and Data Mining, August 1998. 869 B EN -H AIM AND YOM -T OV (a) (b) (c) (d) Figure 5: An example of an execution of the merge procedure. Figure 6: The sum procedure. 870 A S TREAMING PARALLEL D ECISION T REE A LGORITHM Nuno Amado, Joao Gama, and Fernando Silva. Parallel implementation of decision tree learning algorithms. In The 10th Portuguese Conference on Artiﬁcial Intelligence on Progress in Artiﬁcial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving, pages 6–13, December 2001. Catherine L. Blake, Eamonn J. Keogh, and Christopher J. Merz. UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, 1998. http://www.ics.uci.edu/∼mlearn/ MLRepository.html. Leon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, volume 20. MIT Press, Cambridge, MA, 2008. URL http://leon.bottou.org/papers/bottou-bousquet-2008. to appear. Leo Breiman, Jerome H. Friedman, Richard Olshen, and Charles J. Stone. Classiﬁcation and Regression Trees. Wadsworth, Monterrey, CA, 1984. Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005. Janez Dem˘ar. Statistical comparisons of classiﬁers over multiple data sets. Journal of Machine s Learning Research, 7:1–30, 2006. Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, and Wei-Yin Loh. BOAT — optimistic decision tree construction. In ACM SIGMOD International Conference on Management of Data, pages 169–180, June 1999. Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. How to summarize the universe: dynamic maintenance of quantiles. In Proceedings of the 28th VLDB Conference, pages 454–465, 2002. Sanjay Goil and Alok Choudhary. Efﬁcient parallel classiﬁcation using dimensional aggregates. In Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 197–210, August 1999. Michael Greenwald and Sanjeev Khanna. Space-efﬁcient online computation of quantile summaries. In Proceedings of ACM SIGMOD, Santa Barbara, California, pages 58–66, ’may’ 2001. Isaac D. Guedalia, Mickey London, and Michael Werman. An on-line agglomerative clustering method for nonstationary data. Neural Comp., 11(2):521–540, 1999. Sudipto Guha, Nick Koudas, and Kyuseok Shim. Approximation and streaming algorithms for histogram construction problems. ACM Trans. on Database Systems, 31(1):396–438, ’mar’ 2006. Yannis E. Ioannidis. The history of histograms (abridged). In Proceedings of the VLDB Conference, pages 19–30, 2003. Raj Jain and Imrich Chlamtac. The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations. Communications of the ACM, 28(10):1076–1085, ’oct’ 1985. 871 B EN -H AIM AND YOM -T OV Ruoming Jin and Gagan Agrawal. Communication and memory efﬁcient parallel decision tree construction. In The 3rd SIAM International Conference on Data Mining, May 2003. Mahesh V. Joshi, George Karypis, and Vipin Kumar. ScalParC: A new scalable and efﬁcient parallel classiﬁcation algorithm for mining large datasets. In The 12th International Parallel Processing Symposium, pages 573–579, March 1998. Michael Kearns and Yishay Mansour. On the boosting ability of top-down decision tree learning algorithms. Journal of Computer and System Sciences, 58(1):109–128, ’feb’ 1999. Xuemin Lin. Continuously maintaining order statistics over data streams. In Proceedings of the 18th Australian Database Conference, Ballarat, Victoria, Australia, 2007. Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of ACM SIGMOD, Seattle, WA, USA, pages 426–435, 1998. Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. SLIQ: A fast scalable classiﬁer for data mining. In The 5th International Conference on Extending Database Technology, pages 18–32, 1996. Girija J. Narlikar. A parallel, multithreaded decision tree builder. Technical Report CMU-CS-98184, Carnegie Mellon University, 1998. Pascal, 2008. Pascal large scale learning challenge, http://largescale.first.fraunhofer.de, datasets can be downloaded http://ftp.first.fraunhofer.de/pub/projects/largescale. 2008. from PML. IBM Parallel Machine http://www.alphaworks.ibm.com/tech/pml. 2009. Learning Toolbox, John Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classiﬁer for data mining. In The 22nd International Conference on Very Large Databases, pages 544–555, September 1996. Mahesh K. Sreenivas, Khaled Alsabti, and Sanjay Ranka. Parallel out-of-core divide-and-conquer techniques with applications to classiﬁcation trees. In The 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing, pages 555– 562, 1999. Available as preprint in http://ipdps.cc.gatech.edu/1999/papers/207.pdf. Anurag Srivastava, Eui-Hong Han, Vipin Kumar, , and Vineet Singh. Parallel formulations of decision-tree classiﬁcation algorithms. Data Mining and Knowledge Discovery, 3(3):237–261, September 1999. 872