jmlr jmlr2013 jmlr2013-5 jmlr2013-5-reference knowledge-graph by maker-knowledge-mining

5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components


Source: pdf

Author: Kamalika Chaudhuri, Anand D. Sarwate, Kaushik Sinha

Abstract: The principal components analysis (PCA) algorithm is a standard tool for identifying good lowdimensional approximations to high-dimensional data. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We show that the sample complexity of the proposed method differs from the existing procedure in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. We furthermore illustrate our results, showing that on real data there is a large performance gap between the existing method and our method. Keywords: differential privacy, principal components analysis, dimension reduction


reference text

Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. SIGMOD Record, 29(2):439–450, 2000. ISSN 0163-5808. doi: 10.1145/335191.335438. URL http://dx.doi. org/10.1145/335191.335438. Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013. URL http://archive. ics.uci.edu/ml. Keith M. Ball. An elementary introduction to modern convex geometry. In S. Levy, editor, Flavors of Geometry, volume 31 of Mathematical Sciences Research Institute Publications, pages 1–58. Cambridge University Press, 1997. Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS ’07), pages 273–282, New York, NY, USA, 2007. ACM. doi: 10.1145/1265530.1265569. URL http://dx.doi.org/10.1145/1265530.1265569. Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. The Johnson-Lindenstrauss Transform itself preserves differential privacy. In IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS), pages 410–419, October 2012. doi: 10.1109/FOCS.2012.67. URL http://dx.doi.org/10.1109/FOCS.2012.67. Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the SuLQ framework. In Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS ’05), pages 128–138, New York, NY, USA, 2005. ACM. doi: 10.1145/1065167.1065184. URL http://dx.doi.org/10.1145/1065167. 1065184. Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to non-interactive database privacy. In R. E. Ladner and C. Dwork, editors, Proceedings of the 40th Annual ACM Symposium on Theory of Computing (STOC ’08), pages 609–618, New York, NY, USA, 2008. ACM. doi: 10.1145/1374376.1374464. URL http://dx.doi.org/10.1145/1374376. 1374464. Stephen P. Brooks. Markov chain Monte Carlo method and its application. Journal of the Royal Statistical Society. Series D (The Statistician), 47(1):69–100, April 1998. ISSN 00390526. doi: 10.1111/1467-9884.00117. URL http://dx.doi.org/10.1111/1467-9884.00117. Stephen P. Brooks and Andrew Gelman. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4):434–455, December 1998. doi: 10.2307/1390675. URL http://dx.doi.org/10.2307/1390675. Stephen P. Brooks and Gareth O. Roberts. Convergence assessment techniques for Markov chain Monte Carlo. Statistics and Computing, 8(4):319–335, December 1998. doi: 10.1023/A: 1008820505350. URL http://dx.doi.org/10.1023/A:1008820505350. 2937 C HAUDHURI , S ARWATE AND S INHA Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm. Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially private learning. In Sham Kakade and Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory (COLT ’11), volume 19 of JMLR Workshop and Conference Proceedings, pages 155–186, Budapest, Hungary, June 2011. URL http://www.jmlr.org/proceedings/ papers/v19/chaudhuri11a/chaudhuri11a.pdf. Kamalika Chaudhuri and Daniel Hsu. Convergence rates for differentially private statistical estimation. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML-12), ICML ’12, pages 1327–1334, New York, NY, USA, July 2012. Omnipress. URL http://icml.cc/2012/papers/663.pdf. Kamalika Chaudhuri and Nina Mishra. When random sampling preserves privacy. In Cynthia Dwork, editor, Advances in Cryptology - CRYPTO 2006, volume 4117 of Lecture Notes in Computer Science, pages 198–213, Berlin, Heidelberg, August 2006. Springer-Verlag. doi: 10.1007/11818175 12. URL http://dx.doi.org/10.1007/11818175_12. Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069–1109, March 2011. URL http://jmlr.csail.mit.edu/papers/v12/chaudhuri11a.html. Kamalika Chaudhuri, Anand D. Sarwate, and Kaushik Sinha. Near-optimal differentially private principal components. In P. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 998–1006, 2012. URL http://books.nips.cc/papers/files/nips25/NIPS2012_0482.pdf. Yasuko Chikuse. Statistics on Special Manifolds. Number 174 in Lecture Notes in Statistics. Springer, New York, 2003. John H. Conway and Neil J. A. Sloane. Sphere Packing, Lattices and Groups. Springer-Verlag, New York, 1998. Mary Kathryn Cowles and Bradley P. Carlin. Markov Chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91(434):883, June 1996. ISSN 01621459. doi: 10.2307/2291683. URL http://dx.doi.org/10.2307/2291683. Imre Csisz´ r and Prakash Narayan. The capacity of the arbitrarily varying channel revisited : a Positivity, constraints. IEEE Transactions on Information Theory, 34(2):181–193, 1988. doi: 10.1109/18.2627. URL http://dx.doi.org/10.1109/18.2627. Imre Csisz´ r and Prakash Narayan. Capacity of the Gaussian arbitrarily varying channel. IEEE a Transactions on Information Theory, 37(1):18–26, 1991. doi: 10.1109/18.61125. URL http: //dx.doi.org/10.1109/18.61125. Randal Douc, Eric Moulines, and Jeffrey S. Rosenthal. Quantitative bounds on convergence of timeinhomogeneous Markov chains. The Annals of Applied Probability, 14(4):1643–1665, November 2938 A N EAR -O PTIMAL A LGORITHM FOR D IFFERENTIALLY-P RIVATE P RINCIPAL C OMPONENTS 2004. ISSN 1050-5164. doi: 10.1214/105051604000000620. URL http://dx.doi.org/10. 1214/105051604000000620. Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC ’09), pages 371–380, New York, NY, USA, 2009. ACM. doi: 10.1145/1536414.1536466. URL http://dx.doi.org/10.1145/ 1536414.1536466. Cynthia Dwork and Adam Smith. Differential privacy for statistics: What we know and what we want to learn. Journal of Privacy and Confidentiality, 1(2):135–154, 2009. URL http: //repository.cmu.edu/jpc/vol1/iss2/2. Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Serge Vaudenay, editor, Advances in Cryptology - EUROCRYPT 2006, volume 4004 of Lecture Notes in Computer Science, pages 486–503, Berlin, Heidelberg, 2006a. Springer-Verlag. doi: 10.1007/11761679 29. URL http: //dx.doi.org/10.1007/11761679_29. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Shai Halevi and Tal Rabin, editors, Theory of Cryptography, volume 3876 of Lecture Notes in Computer Science, pages 265–284, Berlin, Heidelberg, March 4–7 2006b. Springer. doi: 10.1007/11681878 14. URL http://dx.doi.org/10.1007/11681878_14. Salaheddine El Adlouni, Anne-Catherine Favre, and Bernard Bob´ e. Comparison of methodologies e to assess the convergence of Markov chain Monte Carlo methods. Computational Statistics & Data Analysis, 50(10):2685–2701, June 2006. ISSN 01679473. doi: 10.1016/j.csda.2005.04.018. URL http://dx.doi.org/10.1016/j.csda.2005.04.018. Alexandre Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the Twenty-Second ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems (PODS), pages 211–222, 2003. doi: 10.1145/773153.773174. URL http://dx.doi.org/10.1145/773153.773174. Arik Friedman and Assaf Schuster. Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’10), pages 493–502, New York, NY, USA, 2010. ACM. doi: 10.1145/1835804.1835868. URL http://dx.doi.org/10.1145/1835804.1835868. Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys, 42(4):14:1–14:53, June 2010. doi: 10.1145/1749603.1749605. URL http://dx.doi.org/10.1145/1749603.1749605. Srivatsava Ranjit Ganta, Shiva Prasad Kasiviswanathan, and Adam Smith. Composition attacks and auxiliary information in data privacy. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08), pages 265–273, New York, NY, USA, 2008. ACM. doi: 10.1145/1401890.1401926. URL http://dx.doi.org/10.1145/ 1401890.1401926. 2939 C HAUDHURI , S ARWATE AND S INHA Shuguo Han, Wee Keong Ng, and P.S. Yu. Privacy-preserving singular value decomposition. In Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE), pages 1267 –1270, 2009. doi: 10.1109/ICDE.2009.217. URL http://dx.doi.org/10.1109/ICDE. 2009.217. Moritz Hardt and Aaron Roth. Beating randomized response on incoherent matrices. In Proceedings of the 44th Annual ACM Symposium on Theory of Computing (STOC ’12), pages 1255–1268, New York, NY, USA, 2012. ACM. doi: 10.1145/2213977.2214088. URL http://dx.doi.org/ 10.1145/2213977.2214088. Moritz Hardt and Aaron Roth. Beyond worst-case analysis in private singular vector computation. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC ’13), pages 331–340, New York, NY, USA, June 2013. ACM. doi: 10.1145/2488608.2488650. URL http: //dx.doi.org/10.1145/2488608.2488650. Michael Hay, Chao Li, Gerome Miklau, and David Jensen. Accurate estimation of the degree distribution of private networks. In 2009 Ninth IEEE International Conference on Data Mining (ICDM ’09), pages 169–178, 2009. doi: 10.1109/ICDM.2009.11. URL http://dx.doi.org/ 10.1109/ICDM.2009.11. Peter D. Hoff. Simulation of the matrix Bingham-von Mises-Fisher distribution, with applications to multivariate and relational data. Journal of Computational and Graphical Statistics, 18(2): 438–456, 2009. ISSN 1061-8600. doi: 10.1198/jcgs.2009.07177. URL http://dx.doi.org/ 10.1198/jcgs.2009.07177. Galin L. Jones and James P. Hobart. Honest exploration of intractable probability distributions via Markov Chain Monte Carlo. Statistical Science, 16(4):312–334, 2001. doi: 10.1214/ss/ 1015346317. URL http://dx.doi.org/10.1214/ss/1015346317. Galin L. Jones and James P. Hobart. Sufficient burn-in for Gibbs samplers for a hierarchical random effects model. The Annals of Statistics, 32(2):784–817, April 2004. doi: 10.1214/ 009053604000000184. URL http://dx.doi.org/10.1214/009053604000000184. Boˇtjan Kaluˇ a, Violeta Mirchevska, Erik Dovgan, Mitja Luˇtrek, and Matjaˇ Gams. An agents z s z based approach to care in independent living. In B. de Ruyter et al., editor, International Joint Conference on Ambient Intelligence (AmI-10), volume 6439/2010 of Lecture Notes in Computer Science, pages 177–186. Springer-Verlag, Berlin Heidelberg, 2010. doi: 10.1007/ 978-3-642-16917-5 18. URL http://dx.doi.org/10.1007/978-3-642-16917-5_18. Mikhail Kapralov and Kunal Talwar. On differentially private low rank approximation. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’13), pages 1395–1414, New Orleans, LA, USA, January 2013. Shiva Prasad Kasiviswanathan and Adam Smith. A note on differential privacy: Defining resistance to arbitrary side information. Technical Report arXiv:0803.3946v1 [cs.CR], ArXiV, March 2008. URL http://arxiv.org/abs/0803.3946. Krishnaram Kenthapadi, Aleksandra Korolova, Ilya Mironov, and Nina Mishra. Privacy via the Johnson-Lindenstrauss transform. Journal of Privacy and Confidentiality, 5(1):39–71, 2013. URL http://repository.cmu.edu/jpc/vol5/iss1/2. 2940 A N EAR -O PTIMAL A LGORITHM FOR D IFFERENTIALLY-P RIVATE P RINCIPAL C OMPONENTS John E. Kolassa. Convergence and accuracy of Gibbs sampling for conditional distributions in generalized linear models. The Annals of Statistics, 27(1):129–142, 1999. doi: 10.1214/aos/ 1018031104. URL http://dx.doi.org/10.1214/aos/1018031104. John E. Kolassa. Explicit bounds for geometric covergence of Markov chains. Journal of Applied Probability, 37(3):642–651, 2000. doi: 10.1239/jap/1014842825. URL http://dx.doi.org/ 10.1239/jap/1014842825. Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. Closeness: A new privacy measure for data publishing. IEEE Transactions on Knowledge and Data Engineering, 22(7):943–956, 2010. doi: 10.1109/TKDE.2009.139. URL http://dx.doi.org/10.1109/TKDE.2009.139. Kun Liu, Hillol Kargupta, and Jessica Ryan. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on Knowledge and Data Engineering, 18(1):92–106, 2006. doi: 10.1109/TKDE.2006.14. URL http://dx.doi.org/ 10.1109/TKDE.2006.14. Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE), page 24, 2006. doi: 10.1109/ICDE.2006.1. URL http://dx.doi.org/10.1109/ICDE.2006.1. Ashwin Machanavajjhala, Daniel Kifer, John M. Abowd, Johannes Gehrke, and Lars Vilhuber. Privacy: Theory meets practice on the map. In IEEE 24th International Conference on Data Engineering (ICDE), pages 277–286, April 2008. doi: 10.1109/ICDE.2008.4497436. URL http://dx.doi.org/10.1109/ICDE.2008.4497436. Frank McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In SIGMOD Conference, pages 19–30, 2009. doi: 10.1145/1559845.1559850. URL http://dx.doi.org/10.1145/1559845.1559850. Frank McSherry and Ilya Mironov. Differentially private recommender systems: Building privacy into the netflix prize contenders. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data (KDD), pages 627–636, 2009. doi: 10.1145/1557019.1557090. URL http://dx.doi.org/10.1145/1557019.1557090. Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’07), pages 94–103, October 2007. doi: 10.1109/FOCS.2007.41. URL http://dx.doi.org/10.1109/FOCS.2007.41. Ilya Mironov. On significance of the least significant bits for differential privacy. In Proceedings of the ACM Conference on Computer and Communications Security (CCS ’12), pages 650–661, New York, NY, USA, 2012. ACM. doi: 10.1145/2382196.2382264. URL http://dx.doi.org/ 10.1145/2382196.2382264. Noman Mohammed, Rui Chen, Benjamin C. M. Fung, and Philip S. Yu. Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’11), pages 493–501, New York, NY, USA, 2941 C HAUDHURI , S ARWATE AND S INHA 2011. ACM. doi: 10.1145/2020408.2020487. URL http://dx.doi.org/10.1145/2020408. 2020487. Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing (STOC ’07), pages 75–84, New York, NY, USA, 2007. ACM. doi: 10.1145/1250790. 1250803. URL http://dx.doi.org/10.1145/1250790.1250803. Gareth O. Roberts. Bounds on regeneration times and convergence rates for Markov chains. Stochastic Processes and their Applications, 80(2):211–229, April 1999. ISSN 03044149. doi: 10.1016/S0304-4149(98)00085-4. URL http://dx.doi.org/10.1016/S0304-4149(98) 00085-4. Gareth O. Roberts and Sujit K. Sahu. Approximate predetermined convergence properties of the Gibbs sampler. Journal of Computational and Graphical Statistics, 10(2):216–229, June 2001. ISSN 1061-8600. doi: 10.1198/10618600152627915. URL http://dx.doi.org/10.1198/ 10618600152627915. Jeffrey S. Rosenthal. Minorization conditions and convergence rates for Markov Chain Monte Carlo. Journal of the American Statistical Association, 90(430):558–566, June 1995. ISSN 01621459. doi: 10.2307/2291067. URL http://dx.doi.org/10.2307/2291067. Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft. Learning in a large function space: Privacy-preserving mechanisms for SVM learning. Journal of Privacy and Confidentiality, 4(1):65–100, 2012. URL http://repository.cmu.edu/jpc/vol4/iss1/4/. Claude. E. Shannon. Probability of error for optimal codes in a Gaussian channel. Bell System Technical Journal, 38:611–656, 1959. Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing (STOC ’11), pages 813–822, New York, NY, USA, 2011. ACM. doi: 10.1145/1993636.1993743. URL http: //dx.doi.org/10.1145/1993636.1993743. Gilbert W. Stewart. On the early history of the singular value decomposition. SIAM Review, 35(4): 551–566, December 1993. Latanya Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, October 2002. doi: 10.1142/S0218488502001648. URL http://dx.doi.org/10.1142/S0218488502001648. Peter van der Putten and Maarten van Someren. CoIL Challenge 2000: The Insurance Company Case, 2000. URL http://www.liacs.nl/˜putten/library/cc2000/. Leiden Institute of Advanced Computer Science Technical Report 2000-09. Larry Wasserman and Shuheng Zhou. A statistical framework for differential privacy. Journal of the American Statistical Association, 105(489):375–389, 2010. doi: 10.1198/jasa.2009.tm08651. URL http://dx.doi.org/10.1198/jasa.2009.tm08651. 2942 A N EAR -O PTIMAL A LGORITHM FOR D IFFERENTIALLY-P RIVATE P RINCIPAL C OMPONENTS Oliver Williams and Frank McSherry. Probabilistic inference and differential privacy. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2451–245, 2010. URL http://books.nips.cc/ papers/files/nips23/NIPS2010_1276.pdf. Bin Yu. Assouad, Fano, and Le Cam. In David Pollard, Erik Torgersen, and Grace L. Yang, editors, Festschrift for Lucien Le Cam, Research Papers in Probability and Statistics, chapter 29, pages 423–425. Springer-Verlag, 1997. Justin Z. Zhan and Stan Matwin. Privacy-preserving support vector machine classification. International Journal of Intelligent Information and Database Systems, 1(3/4):356–385, 2007. doi: 10.1504/IJIIDS.2007.016686. Shuheng Zhou, Katrina Ligett, and Larry Wasserman. Differential privacy with compression. In Proceedings of the 2009 International Symposium on Information Theory (ISIT), pages 2718– 2722, Seoul, South Korea, June–July 2009. doi: 10.1109/ISIT.2009.5205863. 2943