Cross-Validation
21 Dec 2011 10:11
One of the most brilliantly simple and compelling ideas in all of statistics: to estimate how well your model will do on new data, take your data set and divide it into two parts at random. Fit the model to one part and then evaluate its prediction on the other; average over a couple of splits into training and testing sets.
As a method of model selection; as (not quite the same thing) a means of estimating the generalization error of a statistical model; relations to bootstrapping. How best to cross-validate time series? Spatial models? Networks? Other kinds of structured data? Relation to "stability" in learning theory.
- Recommended:
- Sylvain Arlot
- "V-fold cross-validation improved: V-fold penalization", arxiv:0802.0566 [Seeing cross-validation as a penalization method, and improving it accordingly by strengthening the penalty term]
- "Model selection by resampling penalization", arxiv:0906.3124 = Electronic Journal of Statistics 3 (2009): 557--624
- Sylvain Arlot and Alain Celisse, "Segmentation of the mean of heteroscedastic data via cross-validation", Statistics and Computing 21 (2011): 613--632, arxiv:0902.3977 [MATLAB code]
- Prabir Burman, Edmond Chow and Deborah Nolan, "A cross-validatory method for dependent data", Biometrika 81 (1994): 351--358 [JSTOR]
- Patrick S. Carmack, William R. Schucany, Jeffrey S. Spence, Richard F. Gunst, Qihua Lin and Robert W. Haley, "Far Casting Cross Validation" [Leave-one-out CV, with a constant-radius window skipped around each hold-out point as well; this is designed to deal with correlations in time or in space. PDF preprint]
- Matthieu Cornec, "Concentration inequalities of the cross-validation estimator for Empirical Risk Minimiser", arxiv:1011.0096
- Charles Mitchell and Sara van de Geer, "General Oracle Inequalities for Model Selection", Electronic Journal of Statistics 3 (2009): 176--204 [Analyzes a data-set splitting scheme (like cross-validation with only one "fold")]
- Jeffrey S. Racine
- "Feasible Cross-Validatory Model Selection for General Stationary Processes", Journal of Applied Econometrics 12 (1997): 169--179 [JSTOR. This is closely related to (maybe algebraically just a special case of?) the familiar trick from splines of writing the CV criterion in terms of the hat/influence/projection matrix.]
- "Consistent cross-validatory model-selection for dependent data: hv-block cross-validation", Journal of Econometrics 99 (2000): 39--61
- Ryan J. Tibshirani and Robert Tibshirani, "A bias correction for the minimum error rate in cross-validation", Annals of Applied Statistics 3 (2009): 822--829 = arxiv:0908.2904
- Mark J. van der Laan and Sandrine Dudoit, "Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples" [PDF working paper, i.e., a 100-page tome. The first part proves that multi-fold cross-validation and the like will work for selecting the best estimator out of a finite set of estimators (provided the loss function is nicely bounded and the data are IID). The second part ingeniously turns this into a complete estimation procedure, by effectively creating a discrete sieve and then using CV to say which part of the sieve to use. This is a very cool set of results, but (1) the limitations to bounded loss functions make me nervous, and (2) the formulas appearing in the finite-sample and even asymptotic bounds are ugly. On the other hand, they have finite-sample bounds! — I wonder if the bounded-and-IID restrictions could be lifted using the techniques in Jiang's "On Uniform Deviation Bounds" (link and description under Learning Theory), or those in Dedecker et al.'s Weak Dependence.]
- Aad W. van der Vaart, Sandrine Dudoit and Mark J. van der Laan, "Oracle inequalities for multi-fold cross validation", Statistics and Decisions 24 (2006): 351--371 [Streamlined and improved versions of the key results from the van der Laan/Dudoit tome. Thanks to Prof. van der Vaart for a reprint]
- To read:
- Sylvain Arlot and Alain Celisse, "A survey of cross-validation procedures for model selection", Statistics Surveys 4 (2010): 40--79
- Yoshua Bengio and Yves Grandvalet, "No unibased estimator of the variance of k-fold cross-validation", Journal of Machine Learning Research 5 (2004): 1089--1105
- P. Burman, "A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods", Biometrika 76 (1989): 503--514
- Alain Celisse, "Model selection in density estimation via cross-validation", arxiv:0811.0802
- Matthieu Cornec, "Estimating Subbagging by cross-validation", arxiv:1011.5142
- Sandrine Dudoit and Mark J. van der Laan, "Asymptotics of Cross-Validated Risk Estimation in Estimator Selection and Performance Assessment", Statistical Methodology 2 (2005): 131--154 [preprint]
- Jenny Häggström and Xavier de Luna, "Estimating
Prediction Error: Cross-Validation vs. accumulated Prediction Error",
Communications in Statistics: Simulation
and Computation 39 (2010): 880--898
- Satyen Kale, Ravi Kumar and Sergei Vassilvitskii, "Cross-Validation and Mean-Square Stability" [PDF preprint via Dr. Kale]
- Michael Kearns and Dana Ron, "Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation," Neural Computation 11 (1999): 1427--1453
- Art B. Owen, Patrick O. Perry, "Bi-cross-validation of the SVD and the nonnegative matrix factorization", Annals of Applied Statistics 3 (2009): 564--594, arxiv:0908.2062
- M. Pavlic and M. J. van der Laan, "Fitting of mixtures with unspecified number of components using cross validation distance estimate", Computational Statistics and Data Analysis 41 (2003): 413--428
- Juan Diego Rodriguez, Aritz Perez and Jose Antonio Lozano, "Sensitivty analysis of k-fold cross validation in prediction error estimation", IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010): 569--575
- Olga Y. Savchuk, Jeffrey D. Hart, and Simon J. Sheather, "Indirect Cross-Validation for Density Estimation", Journal of the American Statistical Association 105 (2010): 415--423
- Hui Shen, William J. Welch, and Jacqueline M. Hughes-Oliver, "Efficient, adaptive cross-validation for tuning and comparing models, with application to drug discovery", Annals of Applied Statistics 5 (2011): 2668--2687
- David Shilane, Richard H. Liang and Sandrine Dudoit, "Loss-Based Estimation with Evolutionary Algorithms and Cross-Validation", UC Berkeley Biostatistics Working Paper 227 [Abstract, PDF]
- Ansgar Steland, "Sequential Data-Adaptive Bandwidth Selection by Cross-Validation for Nonparametric Prediction", arxiv:1010.6202
- Junhui Wang, "Consistent selection of the number of clusters via crossvalidation", Biometrika 97 (2010): 893--904
- Xiaogang Wang and James V. Zidek, "Selecting likelihood weights by cross-validation", math.ST/0505599 = Annals of Statistics 33 (2005): 463--500 ["weighted likelihood was introduced to formally embrace a variety of statistical procedures that trade bias for precision. Unlike its classical counterpart, the weighted likelihood combines all relevant information while inheriting many of its desirable features including good asymptotic properties. ... weights ... need to be judiciously chosen. ... use of cross-validation ... resulting weighted likelihood estimator (WLE) ... weakly consistent and asymptotically normal. ... application to disease mapping ..."]
- To write:
- CRS, "Cross-validation for mixing processes" [using some notions from learning with dependent data]
- CRS + co-conspirators, "Cross-validation for networks"
