The origins of linear time series analysis go back pretty much as far as those of statistics itself*. As soon as someone has a sequence of numbers they care about, "what comes next?" is a pretty natural question. Once the technology of linear least-squares modeling existed, it was equally natural that people would try to use it for time-series forecasting. This tendency was encouraged by some very deep results obtained Hermann Wold in the 1930s, which, roughly speaking, said that any well-behaved ("weakly stationary") time series, Xt, could be represented either as a linear regression on all its own past values plus noise (an "infinite-order autoregression"),
Nonetheless, Wold's results helped lead to the articulation of an elaborate machinery of finite-order autoregressive and moving-average models, including hybrid autoregressive-moving-average ("ARMA") models, with IID Gaussian noise, where
Most time series textbooks concern themselves with exclusively with the ARMA technology**. Fan and Yao cover it in their first four chapters, totaling just under 200 pages. (And some of those pages are general theory of stochastic processes, of broader application.) This seems to me about the upper limit of the attention this material merits for contemporary readers. Chapters 5--10, running about 300 pages, are given over to modern non-parametric methods for time series: estimating the marginal density; estimating conditional expectations; smoothing and removing trends and periodic components; estimating the spectral density or power spectrum; estimating the conditional variance; estimating conditional distributions and prediction intervals; validating predictive models; and testing hypotheses about stochastic processes. Parametric specifications, like the ARMA model, are completely avoided, in favor of estimating the relevant functions directly from data, essentially just by sitting, waiting, and averaging.
To see how this might hope to work, consider just the problem of guessing at the expected value of Xt+1 given Xt,
![\[
\mathbb{E}\left[X_{t+1}|X_t=x\right] \equiv m_t(x) ~ .
\]](index_4.gif)
![\[
\widehat{m}_h(0) = \frac{\sum_{t: |x_t| < h}{x_{t+1}}}{\sum_{t: |x_t| < h}{1}} = \frac{1}{N_h}\sum_{t: |x_t| < h}{x_{t+1}}~.
\]](index_5.gif)
Noise is straightforward: the estimate is based on only a limited number of samples, each of which includes some stochastic fluctuation. For reasonable ("ergodic") time series, if we extend T, then the number of points we average over, Nh, will grow proportionally to T, so the average will converge on an expectation value, and the sampling noise goes away. Under some fairly weak assumptions, the variance of the sampling noise in this estimate goes to zero like 1/hT, "is O(1/hT)", with the factor of h reflecting the fact that larger bandwidths will lead to averaging over proportionately more samples.
Approximation error is slightly trickier. The expectation value this estimate converges on is
![\[
\mathbb{E}\left[\widehat{m}_h(0)\right] =\mathbb{E}\left[X_{t+1}| X_t \in (-h,h)\right] ~,
\]](index_6.gif)
![\[
m(0) = \mathbb{E}\left[X_{t+1}| X_t = 0\right] ~.
\]](index_7.gif)
![\[
\mathbb{E}\left[\widehat{m}(0)\right] - m(0) = O(h^2) ~.
\]](index_8.gif)
To get the total mean-squared error, we need to square the bias and add it to the variance, so the over-all error is
There is an optimal bandwidth which finesses this trade-off between approximation error and instability as well as possible. Elementary calculus tells us that h0 = O(1/T1/5). Notice that this changes with T — as we get more data, we can shrink the bandwidth, and, averaging over finer and finer scales, reveal more and more of the structure in the prediction function m. In fact, if we knew h0, or an Oracle told us, our excess mean-squared error, over s2, would be O(1/T4/5). For an ARMA model, on the other hand, the excess mean-squared error is O(r + 1/T), where the additive constant r reflects the ineradicable systematic error that comes from the fact that the ARMA model is mis-specified. The conventional technology converges faster, but to a systematically wrong answer. This can still be useful, especially if you don't have much data (so that 1/T is noticeably smaller than 1/T4/5), but the local-averaging method ultimately wins.
Obviously, there is nothing special about making the prediction at Xt = 0; we could make it in the vicinity of any point we like. (This would mean averaging a different set of example points, depending on where we want to make a forecast.) We don't have to give every point within the averaging window equal weight; we can use smooth "kernel functions" to give more weight to those which are closer to those which are closer to the point where we want to make a prediction. This is called "kernel smoothing". Finally, we don't have to just take a weighted average; since we like linear regression so much, we could do one on just the historical points that fell within the bandwidth region. (The approximation error is then still O(h2), but the constants are smaller.). This "local linear regression" means solving a different linear least-squares problem for each prediction, but thanks to 200 years of concentrated attention by very smart numerical algorithm designers, linear least squares is fast.
I have spoken so far about what we could do if the Oracle told us h0. What are we to do now that the Oracle has fallen silent (perhaps out of hurt and resentment)? The key trick is to realize that the best bandwidth is defined as the one which allows the best generalization from old data to new data. We can emulate this by dividing the data into parts, fitting our model to one part with a range of bandwidths, and seeing which one does best at predicting the rest of the data. There are a lot of wrinkles to this idea of cross-validation, and I won't here go into the details, or into the alternatives, but the key property is that if we do it right, we select a bandwidth which is very close to what the Oracle would have told us, so close that our over-all error is still O(1/T4/5). In the jargon, non-parametric smoothing "adapts" to the properties of the unknown data-generating process, which lets us discover those properties.
Let me be a bit more concrete, by running a little simulation and treating it like real data. Here is the beginning of a time series I generated from a nonlinear model (the logistic map, plus Gaussian noise, if you care):
If I make a scatter-plot of successive values against each other, there's pretty clearly a lot of structure:
Linear regression is of course quite unable to find this structure; the red line here is the best-fitting line, which would correspond to an AR(1).
Using more flexible ARMA models doesn't help; for instance, if I go out to an AR(8), where each point is linearly regressed on the eight previous values, my predictions are still basically flat.
On the other hand, a straight-forward nonparametric smoothing picks up the obvious structure (blue points):
The foregoing was a deliberate caricature: just enough of the key features, exaggerated, to be recognizable, with all the other details suppressed. Fan and Yao do not deal in caricatures, and give a rather thorough account, on the statistical side, of kernel smoothing and local liner regression; of the alternative smoothing technology called "splines"; and of the difficulties all of these can run into when using not just one but many predictor variables, and the solutions, like additive models. (Their other main statistical tools are the bootstrap and cross-validation, about which they say less.) On the stochastic side, they rely on central limit theorems under mixing conditions for the convergences which I blithely assumed above. Their results are either asymptotic or empirical; there is no attempt at finite-sample performance guarantees. Many proofs are only sketched, or referred in their entirety to papers. The emphasis is consistently on statistical methodology, rather than on statistical theory****.
In principle, this could be read by someone with no previous exposure to either time series analysis or to non-parametric statistics, but a basic knowledge of probability theory, linear regression and statistical inference. In practice, it should be fairly straightforward (though not effortless) reading for someone ignorant of the usual time series technology but acquainted with non-parametrics (at the level of, say, Simonoff's Smoothing Methods in Statistics, or Wasserman's All of Nonparametric Statistics). Going the other way, and teaching non-parametrics to an ARMA-mechanic, would I suspect be harder going. There are no problem sets, but there are lots of examples, many with economic or financial content, and reproducing them, and filling in the steps of the proofs, would provide plenty of work for either a class or self-study.
Naturally, there are things I wish it did more of: there is nothing on state-space or hidden-Markov models (let alone state-space reconstruction), and little on fitting or testing stochastic models based on actual scientific theories of the data-generating mechanism*****. Non-parametric estimation of conditional distributions also gets less attention than I feel it deserves. On the computational side, let's just say that this book was finished at the end of 2002, and it shows. Still, it's at least as good as anything I've seen on statistical methods for nonlinear time series which has been published since. A second edition would, however, be very welcome. In the meanwhile, if we are going to continue to rely on time series models, this is a good place to start.
*: Klein's Statistical Visions in Time: A History of Time Series Analysis, 1662--1938 is a surprisingly readable history of these developments. ^
**: To my mind, the two best of these textbooks
are Shumway
and Stoffer (more elementary)
and Brockwell
and Davis (more theoretical). (Disclaimer: I know
Stoffer, and
Brockwell's son is a colleague.) ^
***: Re-write our local averaging estimate
in terms of the function we are trying to find and the stochastic noise which
disturbs observations away from it:

![\begin{eqnarray*}
\mathbb{E}\left[\widehat{m}_h(0) \right] & = & \mathbb{E}\left[m(X_t)| X \in (-h,h) \right]\\
& = & \mathbb{E}\left[m(0) + Xm^{\prime}(0) + \frac{X^2}{2}m^{\prime\prime}(0) + o(h^2)| X \in (-h,h)\right]\\
& = & m(0) + m^{\prime}(0) \mathbb{E}\left[X|X \in (-h,h)\right] + \frac{m^{\prime\prime}(0)}{2}\mathbb{E}\left[X^2|X \in (-h,h)\right]
\end{eqnarray*}](index_10.gif)
Naively, then, one would think the systematic error of approximating m(0) in this manner would be O(m'(0)h). Reflect, however, that if the density of X is symmetric around zero, then there would be no net contribution from m'(0) --- data points to the right of zero, introducing one error of order hm'(0), will be canceled out by data points on the left of 0, introducing an error of order -hm'(0). To get a net error contribution from the linear term, we need the samples to be biased to one side of 0 or other, and this bias will be proportional to hf'(0), where f is the density of X, for a total contribution to the error of O(h2m'(0)f'(0)). The quadratic term yields a straight-forward O(h2 m''(0)) contribution, and one can verify (somewhat tediously) that pushing the Taylor expansion of m and f to higher orders only yields comparatively negligible, o(h2), terms. ^
****: Fan and Yao note in passing
that splines
and additive models date back
to the 1920s, but were simply not practical with that era's computing
technology. (Kernel smoothing goes back to at least the 1950s.) The continued
preference of non-statisticians for linear models over additive models, in
particular, seems to have no basis other than ignorance
historical inertia.
If you really want a modern treatment of statistical theory for time series, and I confess to a possibly-morbid taste for the stuff, I strongly recommend Masanobu Taniguchi and Yoshihide Kakizawa, Asymptotic Theory of Statistical Inference for Time Series, and Bosq and Blanke, Inference and Prediction in Large Dimensions. ^
*****: Cf. this paper by Reilly and Zeringue, recently highlighted by Andrew Gelman. For that matter, this is the approach advocated by Guttorp in his great Stochastic Modeling of Scientific Data. ^
Hardback, ISBN 978-0-387-95170-6; paperback, ISBN 978-0-387-26142-3