nips nips2013 nips2013-225 nips2013-225-reference knowledge-graph by maker-knowledge-mining

225 nips-2013-One-shot learning and big data with n=2


Source: pdf

Author: Lee H. Dicker, Dean P. Foster

Abstract: We model a “one-shot learning” situation, where very few observations y1 , ..., yn ∈ R are available. Associated with each observation yi is a very highdimensional vector xi ∈ Rd , which provides context for yi and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of xi is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. 1


reference text

[1] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28:594–611, 2006. 8

[2] R. Salakhutdinov, J.B. Tenenbaum, and A. Torralba. One-shot learning with a hierarchical nonparametric Bayesian model. JMLR Workshop and Conference Proceedings Volume 26: Unsupervised and Transfer Learning Workshop, 27:195–206, 2012.

[3] M.C. Frank, N.D. Goodman, and J.B. Tenenbaum. A Bayesian framework for cross-situational wordlearning. Advances in Neural Information Processing Systems, 20:20–29, 2007.

[4] J.B. Tenenbaum, T.L. Griffiths, and C. Kemp. Theory-based Bayesian models of inductive learning and reasoning. Trends in Cognitive Sciences, 10:309–318, 2006.

[5] C. Kemp, A. Perfors, and J.B. Tenenbaum. Learning overhypotheses with hierarchical Bayesian models. Developmental Science, 10:307–321, 2007.

[6] S. Carey and E. Bartlett. Acquiring a single new word. Proceedings of the Stanford Child Language Conference, 15:17–29, 1978.

[7] L.B. Smith, S.S. Jones, B. Landau, L. Gershkoff-Stowe, and L. Samuelson. Object name learning provides on-the-job training for attention. Psychological Science, 13:13–19, 2002.

[8] F. Xu and J.B. Tenenbaum. Word learning as Bayesian inference. Psychological Review, 114:245–272, 2007.

[9] M. Fink. Object classification from a single example utilizing class relevance metrics. Advances in Neural Information Processing Systems, 17:449–456, 2005.

[10] P. Hall, J.S. Marron, and A. Neeman. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67:427–444, 2005.

[11] P. Hall, Y. Pittelkow, and M. Ghosh. Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70:159–173, 2008.

[12] Y.I. Ingster, C. Pouet, and A.B. Tsybakov. Classification of sparse high-dimensional vectors. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367:4427– 4448, 2009.

[13] W.F. Massy. Principal components regression in exploratory statistical research. Journal of the American Statistical Association, 60:234–256, 1965.

[14] I.M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29:295–327, 2001.

[15] D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17:1617–1642, 2007.

[16] B. Nadler. Finite sample approximation results for principal component analysis: A matrix perturbation approach. Annals of Statistics, 36:2791–2817, 2008.

[17] I.M. Johnstone and A.Y. Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104:682–693, 2009.

[18] S. Jung and J.S. Marron. PCA consistency in high dimension, low sample size context. Annals of Statistics, 37:4104–4130, 2009.

[19] S. Lee, F. Zou, and F.A. Wright. Convergence and prediction of principal component scores in highdimensional settings. Annals of Statistics, 38:3605–3629, 2010.

[20] Q. Berthet and P. Rigollet. Optimal detection of sparse principal components in high dimension. arXiv preprint arXiv:1202.5070, 2012.

[21] S. Jung, A. Sen, and J.S Marron. Boundary behavior in high dimension, low sample size asymptotics of PCA. Journal of Multivariate Analysis, 109:190–203, 2012.

[22] Z. Ma. Sparse principal component analysis and iterative thresholding. Annals of Statistics, 41:772–801, 2013.

[23] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 197–206, 1955.

[24] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58:267–288, 1996.

[25] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition, 2009.

[26] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.

[27] K.S. Azoury and M.K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43:211–246, 2001. 9