Model Selection
29 Apr 2008 13:16
(Reader, please make your own suitably awful pun about the different senses of "model selection" here, as a discouragement to those finding this page through prurient searching. Thank you.)
In statistics and machine learning, "model selection" is the problem of picking among different mathematical models which all purport to describe the same data set. This notebook will not (for now) give advice on it; as usual, it's more of a place to organize my thoughts and references...
Classification of approaches to model selection (probably not really exhaustive but I can't think of others, right now):
- Direct optimization of some measure of goodness of fit or risk on training data.
- Seems implicit in a lot of work which points to marginal improvements in "the proportion of variance explained", mis-classification rates, "perplexity", etc. Often, also, a recipe for over-fitting and chasing snarks. What's wanted is (almost always) some way of measuring the ability to generalize to new data, and in-sample performance is a biased estimate of this. Still, with enough data, if the gods of ergodicity are kind, in-sample performance is representative of generalization performance, so perhaps this will work asymptotically, though in many cases the researcher will never even glimpse Asymptopia across the Jordan.
- Optimize fit with model-dependent penalty
- Add on a term to each model which supposed indicates its ability to over-fit. (Adjusted R^2, AIC, BIC, ..., all do this in terms of the number of parameters.) Sounds reasonable, but I wonder how many actually work better, in practice, than direct optimization. (See Domingos for some depressing evidence on this score.)
- Classical two-part minimum description length methods were penalties; I don't yet understand one-part MDL.
- Penalties which depend on the model class
- Measure the capacity of a class of models to over-fit; penalize all models in that class accordingly, regardless of their individual properties. Outstanding example: Vapnik's "structural risk minimization" (provably consistent under some circumstances). Only sporadically coincides with AIC/BIC/etc. type penalties based on the number of parameters.
- Cross-validation
- Estimate the ability to generalize to different data by, in fact, using different data. Maybe the "industry standard" of machine learning. Query, how are we to know how much different data to use?
- The method of sieves
- Directly optimize the fit, but within a constrained class of models; relax the constraint as the amount of data grows. If the constraint is relaxed slowly enough, should converge on the truth. (Ordinary parametric inference, within a single model class, is a limiting case where the constraint is relaxed infinitely slowly, and we converge on the pseudo-truth within that class [provided we have a consistent estimator].)
- Encompassing models
- Come up with a single model class which includes all the interesting model classes as special cases; do ordinary estimation within it. Getting a consistent estimator of the additional parameters this introduces is often non-trivial, and interpretability can be a problem.
- Model averaging
- Don't try to pick the best or correct model; use them all with different weights. Chose the weighting scheme so that if one is best, it will tend to be more and more influential. Often I think the improvement is not so much from using multiple models as from smoothing, since estimates of the single best model are going to be more noisy than estimates of a bunch of models which are all very good.
- Adequacy testing
- The correct model should be able to encode the data as uniform IID noise. Test whether "residuals", in the appropriate sense, are IID uniform. Reject models which can't hack it. Possibly none of the models on offer is adequate; this, too, is informative. Or: models make specific probabilistic assumptions (IID Gaussian noise, for example); test those. Mis-specification testing.
The machine-learning-ish literature on model selection doesn't seem to ever talk about setting up experiments to select among models; or do I just not read the right papers there? (The statistical literature on experimental design tends to talk about "model discrimination" rather than "model selection".)
- Recommended:
- A. C. Atkinson and A. N. Donev, Optimum Experimental Design [Review]
- Leo Breiman, "Heuristics of Instability and Stabilization in Model Selection," Annals of Statistics 24 (1996): 2350--2383 [JSTOR]
- Pedro Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery, 3 (1999) [Online]
- Trever Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction
- Aris Spanos, Spanos, "Curve-Fitting, the Reliability of Inductive Inference and the Error-Statistical Approach" [PDF preprint]
- Sara van de Geer, Empirical Process Theory in M-Estimation
- V. N. (=Vladimir Naumovich) Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
- Quang H. Vuong, "Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses", Econometrica 57 (1989): 307--333
- To read:
- Maria Maddalena Barbieri and James O. Berger, "Optimal Predictive Model Selection", math.ST/0406464 = Annals of Statistics 32 (2004): 870--897 [Unfortunately, Bayesian]
- Lucien Birge
- "The Brouwer Lecture 2005: Statistical estimation with model selection", math.ST/0605187
- "Model selection for Poisson processes", math/0609549
- Lucien Birge and Pascal Massart, "Minimal Penalties for Gaussian Model Selection", Probability Theory and Related Fields 138 (2007): 33--73
- Borowiak, Model Discrimination for Nonlinear Regression Models
- Kenneth P. Burnham and David R. Anderson, Model Selection and Inference: A Practical Information-Theoretic Approach
- A. E. Clark and C. G. Troskie, "Time Series and Model Selection", Communications in Statistics: Simulation and computing 37 (2008): 766--771 [Simulation study of the accuracy of different information criteria]
- Kevin A. Clarke, "A Simple Distribution-Free Test for Nonnested Hypotheses" [PDF preprint]
- Guilhem Coq, Olivier Alata, Marc Arnaudon and Christian Olivier, "An improved method for model selection based on Information Criteria", math.ST/0702540
- Pedro Domingos
- Magalie Fromont, "Model selection by bootstrap penalization for classification", Machine Learning 66 (2007): 165--207
- Christophe Giraud, "Estimation of Gaussian graphs by model selection", arxiv:0710.2044
- Christian Gourieroux and Alain Monfort, "Testing, Encompassing, and Simulating Dynamic Econometric Models", Econometric Theory 11 (1995): 195--228 [JSTOR]
- Marcus Hutter, "The Loss Rank Principle for Model Selection", math.ST/0702804 [This sounds a bit like severity]
- Michael Kearns and Dana Ron, "Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation," Neural Computation 11 (1999): 1427--1453
- Nicholas M. Kiefer and Hwan-Sik Choi, "Robust Model Selection in Dynamic Models with an Application to Comparing Predictive Accuracy" [SSRN]
- Sadanori Konishi and Genshiro Kitagawa, "Asymptotic theory for information crteria in model selection --- functional approach," Journal of Statistical Planning and Inference 114 (2003): 45--61
- Hannes Leeb and Benedikt M. Poetscher, "Can One Estimate The Unconditional Distribution of Post-Model-Selection Estimators?", arxiv:0704.1584 [They claim the answer is "No".]
- F. Liang and A. Barron, "Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection", IEEE Transactions on Information Theory 50 (2004): 2708--2726
- Abraham Meidan and Boris Levin, "Choosing from Competing Theories in Computerised Learning", Minds and Machines 12 (2002): 119--129
- Grayham E. Mizon and Massimiliano Marcellino (eds.), Progressive Modelling: Non-nested Testing and Encompassing [Blurb, table of contents]
- Ali Mohammad-Djafari, "Model selection for inverse problems: Best choice of basis functions and model order selection," physics/0111020
- Pradeep Ravikumar, Martin J. Wainwright, John D. Lafferty, "High-Dimensional Graphical Model Selection Using $\ell_1$-Regularized Logistic Regression", arxiv:0804.4202
- Douglas Rivers and Quang H. Vuong, "Model selection tests for nonlinear dynamic models", The Econometrics Journal 5 (2002): 1--39
- Aris Spanos
- "Statistical Induction, Severe Testing, and Model Validation" [Preprint]
