Attention conservation notice: Intellectuals gathering in Berkeley to argue about "knowledge" and "revolution".
This looks like fun, and if I didn't have conflicting obligations I'd definitely be there.
From Data to Knowledge: Machine-Learning with Real-time & Streaming Applications
May 7-11 2012
On the Campus of the University of California, BerkeleyWe are experiencing a revolution in the capacity to quickly collect and transport large amounts of data. Not only has this revolution changed the means by which we store and access this data, but has also caused a fundamental transformation in the methods and algorithms that we use to extract knowledge from data. In scientific fields as diverse as climatology, medical science, astrophysics, particle physics, computer vision, and computational finance, massive streaming data sets have sparked innovation in methodologies for knowledge discovery in data streams. Cutting-edge methodology for streaming data has come from a number of diverse directions, from on-line learning, randomized linear algebra and approximate methods, to distributed optimization methodology for cloud computing, to multi-class classification problems in the presence of noisy and spurious data.
This conference will bring together researchers from applied mathematics and several diverse scientific fields to discuss the current state of the art and open research questions in streaming data and real-time machine learning. The conference will be domain driven, with talks focusing on well-defined areas of application and describing the techniques and algorithms necessary to address the current and future challenges in the field.
Sessions will be accessible to a broad audience and will have a single track format with additional rooms for breakout sessions and posters. There will be no formal conference proceedings, but conference applicants are encouraged to submit an abstract and present a talk and/or poster.
See the conference page for submission details, schedules, etc.
Via conference organizer and CMU alumnus Joey Richards.
Posted by crshalizi at February 19, 2012 12:44 | permanent link
Attention conservation notice: Only of interest if you (1) like hearing people talk about statistics and machine learning, and (2) will be in Pittsburgh next week.
I have been remiss about advertising upcoming talks.
As always, the talks are free and open to the public.
(You see why I have trouble keeping up with these.)
Posted by crshalizi at February 19, 2012 12:30 | permanent link
In which extinct charismatic megafauna give us an excuse to practice basic programming, bootstrapping, and specification testing.
Posted by crshalizi at February 15, 2012 14:15 | permanent link
Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.
Reading: Notes, chapter 10
Posted by crshalizi at February 15, 2012 14:10 | permanent link
A change to the lecture schedule, by popular demand!
R programs are built around functions: pieces of code that take inputs or arguments, do calculations on them, and give back outputs or return values. The most basic use of a function is to encapsulate something we've done in the terminal, so we can repeat it, or make it more flexible. To assure ourselves that the function does what we want it to do, we subject it to sanity-checks, or "write tests". To make functions more flexible, we use control structures, so that the calculation done, and not just the result, depends on the argument. R functions can call other functions; this lets us break complex problems into simpler steps, passing partial results between functions. Programs inevitably have bugs: debugging is the cycle of figuring out what the bug is, finding where it is in your code, and fixing it. Good programming habits make debugging easier, as do some tricks. Avoiding iteration. Re-writing code to avoid mistakes and confusion, to be clearer, and to be more flexible.
Reading: Notes, chapter 9
Optional reading: Slides from 36-350, introduction to statistical computing, especially through lecture 15.
R for in-class demos (based around the previous problem set)
Posted by crshalizi at February 15, 2012 14:05 | permanent link
Attention conservation notice: Academics with blogs quibbling about obscure corners of applied statistics.
Lurkers in e-mail point me to this pushback against the general pushback against power laws, and ask me to comment. It might be a mistake to do so, but I'm feeling under the weather and so splenetic, so I will.
In our paper, we looked at 24 quantities which people claimed showed power law distributions. Of these, there were seven cases where we could flat-out reject a power law, without even having to consider an alternative, because the departures of the actual distribution from even the best-fitting power law was much too large to be explained away as fluctuations. (One of the wonderful thing about a stochastic model is that it tells you how big its own errors should be.) In contrast, there was only one data set where we could rule out the log-normal distribution.
In some of those cases, you can patch things up, sort of, by replacing a pure power law with a power-law with an exponential cut-off. That is, rather than the probability density being proportional to x-a, it's proportional to x-ae-x/L. (Either way, I am only talking about the probability density in the "right tail", i.e., for x above some xmin.) This gives the infamous straight-ish patch on a log-log plot, for values of x much smaller than L, but otherwise it has substantially different properties. In ten of the twelve cases we looked at, the only way to save the idea of a power-law at all is to include this exponential cut-off. But that exponentially-shrinking factor is precisely what squelches the WTF, X IS ELEVENTY TIMES LARGER THAN EVER! THE BIG ONE IS IN OUR BASE KILLING OUR DOODZ!!!!1!! mega-events. There were ten more cases where we judged the support for power laws as "moderate", meaning "the power law is a good fit but that there are other plausible alternatives as well" (pardon the self-quotation.) Again, those alternatives, like log-normals and stretched exponentials, give very different tail-behavior, with not so much OMG DOOM.
We found exactly one case where the statistical evidence for the power-law was "good", meaning that "the power law is a good fit and that none of the alternatives considered is plausible", which was Zipf's law of word frequency distributions. We were of course aware that when people claim there are power laws, they usually only mean that the tail follows a power law. This is why all these comparisons were about how well the different distributions fit the tail, excluding the body of the data. We even selected where "the tail" begins to maximize the fit to a power law for each case. Even so, there was just this one case where the data compelling support a power law tail.
(All of this — the meaning of "with cut-off", the meaning of our categorizations, the fact that we only compare the tails, etc. — is clear enough from our paper, if you actually read the text. Or even just the tables and their captions.)
I bring up the OMG DOOM because some people, Hanson very much included, like to extrapolate from supposed power laws for various Bad Things to scenarios where THE BIG ONE kills off most of humanity. But, at least with the data we found, the magnitudes of forest fires, solar flares, earthquakes and wars were all better fit by log-normals, by stretched exponentials and by cut-off power laws than by power laws. For fires, flares and quakes, the differences are large enough that they clearly fall into the "with cut-off only" category. The differences in fits for the war-death data are smaller, as (mercifully) is the sample size, so we put it in the "moderate" support category. If you had some compelling other reason to insist on a power law rather than (e.g.) a log-normal there, the data wouldn't slap you down, but they wouldn't back you up either.
Now, I relish the schadenfreude-laden flavors of a mega-disaster scenario as much as the next misanthropic, science-fiction-loving geek, especially when it's paired with some "The fools! Can't they follow simple math?" on the side. Truly, I do. But squeezing that savory, juicy DOOM out of (for instance) the distribution of solar flares relies on the shape of the tail, i.e., whether it's a pure power law or not. The weak support, in the data, for such powers law means you don't really have empirical evidence for your scenarios, and in some cases what evidence there is tells against them. It's a free country, so you can go on telling those stories, but don't pretend that they owe more to confronting hard truths than to literary traditions.
Posted by crshalizi at February 15, 2012 14:00 | permanent link
Attention conservation notice: 1500 word pedagogical-statistical rant, with sarcasm, mathematical symbols, computer code, and a morally dubious affectation of detachment from the human suffering behind the numbers. Plus the pictures are boring.
Does anyone know when the correlation coefficient is useful, as opposed to when it is used? If so, why not tell us?
— Tukey (1954: 721)
If you have taken any sort of statistics class at all, you have probably been exposed to the idea of the "proportion of variance explained" by a regression, conventionally written R2. This has two definitions, which happen to coincide for linear models fit by least squares. The first is to take the correlation between the model's predictions and the actual values (R) and square it (R2), getting a number which is guaranteed to be between 0 and 1. You get 1 only when the predictions are perfectly correlated with reality, and 0 when there is no linear relationship between them. The other definition is the ratio of the variance of the predictions to the variance of the actual values. It is this latter which leads to the notion that R2 is the proportion of variance explained by the model.
The use of the word "explained" here is quite unsupported and often actively misleading. Let me go over some examples to indicate why.
Start by supposing that a linear model is true:
Well, no. The answer depends on the variance of X, which it will be convenient to call v. The variance of the predictions is b2 v, but the variance of Y is larger, b2 v + s. The ratio is
Now, you say, this is a silly algebraic curiosity. Never mind the Good Fairy of Statistical Modeling handing us the correct parameters, let's talk about something gritty and real, like death in Chicago.
![]() |
| Number of deaths each day in Chicago, 1 January 1987--31 December 2000, from all causes except accidents. (Click this and all later figures for larger PDF versions. See below for link to code.) |
I can relate deaths to time in any number of ways; the next figure shows what I get when I use a smoothing spline (and use cross-validation to pick how much smoothing to do). The statistical model is
|
| As before, but with the addition of a smoothing spline. |
The root-mean-square error of the smoothing spline is just above 12 deaths/day. The R2 of the fit is either 0.35 (squared correlation between predicted and actual deaths) or 0.33 (variance of predicted deaths over variance of actual deaths). It seems absurd, however, to say that the date explains how many people died in Chicago on a given day, or even the variation from day to day. The closest I can come up with to an example of someone making such a claim would be an astrologer, and even one of them would work in some patter about the planets and their influences. (Numerologists, maybe? I dunno.)
Worse is to follow. The same data set which gives me these values for Chicago includes other variables, such as the concentration of various atmospheric pollutants and temperature. I can fit an additive model, which tries to tease out the separate relationships between each of those variables and deaths in Chicago, without presuming a particular functional form for each relationship. In particular I can try the model
The R2 of this model is 0.27. Is this "variance explained"? Well, it's at least not incomprehensible to talk about changes in temperature or pollution explaining changes in mortality. In fact, adding this model's predictions to the simple spline's, we see that most of what the spline predicted from the date is predictable from pollution and temperature:
|
| Black dots: actual death counts. Red curve: spline smoothing on the date alone. Blue lines: predictions from the temperature-and-pollution model. |
We could, in fact, try to include the date in this larger model:
|
|
|
|
Despite the lack of visual drama, putting a smooth function of time back into the model increases R2, from 0.27 to 0.30. Formally, the date enters into the model in exactly the same way as particulate pollution. But, again, only a fortune teller — an unusually numerate fortunate teller, perhaps a subscriber to the Journal of Evidence-Based Haruspicy — would say that the date explains, or helps explain, 3% of the variance.
I hope that by this point you will at least hesitate to think or talk about R2 as "the proportion of variance explained". (I will not insist on your never talking that way, because you might need to speak to the deluded in terms they understand.) How then should you think about it? I would suggest: the proportion of variance retained, or just kept, by the predictions. Linear regression is a smoothing method. (It just smoothes everything on to a line, or more generally a hyperplane.) It's hard for any smoother to give fitted values which have more variance than the variable it is smoothing. R2is merely the fraction of the target's variance which is not smoothed away.
This of course raises the question of why you'd care about this number at all. If prediction is your goal, then it would seem much more natural to look at mean squared error. (Or really root mean squared error, so it's in the same units as the variable predicted.) Or mean absolute error. Or median absolute error. Or a genuine loss function. If on the other hand you want to get some function right, then your question is really about mis-specification, and/or confidence sets of functions, and not about whether your smoother is following every last wiggle of the data at all. If you want an explanation, the fact that there is a peak in deaths every year of about the same height, but the predictions fall short of it, suggests that this model is missing something. The fact that the data shows something awful happened in 1995 and the model has nothing adequate to say about it suggests that whatever's missing is very important.
Code for reproducing the figures and analyses in R. (I make this public, despite the similarity of this exercise to the last problem-set in advanced data analysis, because (i) it's not exactly the same, (ii) the homework is due in ten hours, (iii) none of my students would dream of copying this and turning it in as their own, and (iv) I borrowed the example from Simon Wood's Generalized Additive Models.)
Posted by crshalizi at February 13, 2012 23:54 | permanent link
1. I'd like to say that you have no idea how long I have waited to read something like this piece by Michael Stumpf and Mason Porter in one of the glossy journals. But that would be a lie, because if you've been reading this for any length of time, you know that the answer is, long enough to be very tiresome about it. If the referees, and still more the editors, at those journals can be persuaded to pay attention, we will be on track for my mid-2007 hope that "in five to ten years even science journalists and editors of Wired will begin to get the message." (I never really had any hopes for Wired.)
2. You can imagine how my heart sank to see that Krugman had a post titled "The Power (Law) of Twitter" — and my relief to see that he's not actually saying that the distribution of followers is a power law. It is however interesting that the distribution is so close to a log-normal.
3. My ex-boss and mentor Melanie Mitchell has a blog, and promises a substantive series of posts on power laws and scaling. In the meanwhile, go read her book.
Update, 15 February: see later post.
Manual trackback: Brendan O'Connor
(Nos. 1 and 2 via too many to list.)
Posted by crshalizi at February 13, 2012 20:40 | permanent link
The "curse of dimensionality" limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimensionality. Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This generalizes linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Examples in R using the California house-price data. Conclusion: there are no statistical reasons to prefer linear models to additive models, hardly any scientific reasons, and increasingly few computational ones; the continued thoughtless use of linear regression is a scandal.
Reading: Notes, chapter 8; Faraway, chapter 12
Posted by crshalizi at February 09, 2012 10:30 | permanent link
In which spline regression becomes a matter of life and death in Chicago.
Posted by crshalizi at February 07, 2012 10:31 | permanent link
Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression.
Reading: Notes, chapter 7; Faraway, section 11.2.
Posted by crshalizi at February 07, 2012 10:30 | permanent link
Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.
Reading: Notes, chapter 6; Faraway, section 11.3.
Posted by crshalizi at February 02, 2012 10:30 | permanent link