Attention conservation notice: 1400 words on a friend's proposal to do away with peer review, written many weeks ago when there was actually some debate about this.
Larry is writing about peer review (again), this time to advocate "A World Without Referees". Every scientist, of course, has day-dreamed about this, in a first-lets-kill-all-the-lawyers way, but Larry is serious, so let's treat this seriously. I'm not going to summarize his argument; it's short and you can and should go read it yourself.
I think it helps, when thinking about this, to separate two functions peer-reviewed journals and conferences have traditionally served. One is spreading claims (dissemination), and the other is letting readers know about claims worthy of their attention (certification).
Arxiv, or something like it, can take over dissemination handily. Making copies of papers is now very cheap and very fast, so we no longer have to be choosy about which ones we disseminate. In physics, this use of Arxiv is just as well-established as Larry says. In fact, one reason Arxiv was able to establish itself so rapidly and thoroughly among physicists was that they already had a well-entrenched culture of circulating preprints long before journal publication. What Arxiv did was make this public and universally accessible.
But physicists still rely on journals for certification. People pay more attention to papers which come out in Physical Review Letters, or even just Physical Review E, than ones which are only on Arxiv. "Could it make it past peer review?" is used by many people as a filter to weed out the stuff which is too wrong or too unimportant to bother with. This doesn't work so well for those directly concerned with a particular research topic, but if something is only peripherally of interest, it makes a lot of sense.
Even within a specialized research community, consisting entirely of experts who can evaluate new contributions on their own, there is a rankling inefficiency to the world without referees. Larry talks about spending a minute or two looking at new stats. papers on Arxiv every day. But everyone filtering Arxiv for themselves is going to get harder and harder as more potentially-relevant stuff gets put on it. I'm interested in information theory, so I've long looked at cs.IT, and it's become notably more time-consuming as that community has embraced the Arxiv. Yet within any given epistemic community, lots of people are going to be applying very similar filters. So the world-without-referees has an increasing amount o work being done by individuals, but a lot of that work is redundant. Efficiency, the division of labor, points to having a few people put their time into filtering, and the rest of us relying on it, even when in principle we could do the filtering ourselves. To be fair, of course, we should probably take this job in turns...
So: if all papers get put on Arxiv, filtering becomes a big job, so efficiency pushes us towards having only some members of the research community do the filtering for the rest. We have re-invented something very much like peer review, purely so that our lives are not completely consumed by evaluating new papers, and we can actually get some work done.
Larry's proposal for a world without referees also doesn't seem to take into account the needs of researchers to rely on findings in fields in which they are not experts, and so can't act as their own filters. (Or they could if they put in a few years in something else first.) If I need some result from neuroscience, or for that matter from topology, I do not have the time to spend becoming a neuroscientist or topologist, and it is an immense benefit to have institutions I can trust to tell me "these claims about cortical columns, or locally compact Hausdorff spaces, are at least not crazy". This is also a kind of filtering, and there is the same push, based on the division of labor, to rely on only some neuroscientists or topologists to do the filtering for outsiders (or all of them only some of the time), and again we have re-created something very much like refereeing.
So: some form or forms of filtering is inevitable, and the forces pushing for a division of labor in filtering are very strong. I don't know of any reason to think that the current, historically-evolved peer review system is the best way of organizing this cognitive triage, but we're not going to avoid having some such system, nor should we want to. Different ways of organizing the work of filtering will have different costs and benefits, but we should be talking about those and those trade-offs, not hoping that we can just wish the problem away now that making copies is cheap1. It's not at all obvious, for instance, that attention-filtering for the internal benefit of members of a research community should be done in the same way as reliability-filtering for outsiders. But, to repeat, we are going to have filters and they are almost certainly going to involve a division of labor.
Lenin, supposedly, said that "small production engenders capitalism and the bourgeoisie daily, hourly, spontaneously and on a mass scale" (Nove, The Economics of Feasible Socialism Revisited, p. 46). Whether he was right about the bourgeoisie or not, the rate of production of the scientific literature, the similarity of interests and standards with a community, and the need to rely on other field's findings are all doing to engender refereeing-like institutions, "daily, hourly, spontaneously and on a mass scale". I don't think Larry would go to the same lengths to get rid of referees that Lenin went to get rid of the bourgeoisie, but in any case the truly progressive course is not to suppress the old system by force, but to provide a superior alternative.
Speaking personally, I am attracted to a scenario we might call "peer review among consenting adults". Let anyone put anything on Arxiv (modulo the usual crank-screen). But then let others create filtered versions, applying such standards of topic, rigor, applicability, writing quality, etc., as they please --- and be explicit about what those standards are. These can be layered as deep as their audience can support. Presumably the later filters would be intended for those further from active research in the area, and so would be less tolerant of false alarms, and more tolerant of missing possible discoveries, than the filters for those close to the work. But this could be an area for experiment, and for seeing what people actually find useful. This is, I take it, more or less what Paul Ginsparg proposes, and it has a lot to recommend it. Every contribution is available if anyone wants to read it, but no one is compelled to try to filter the whole flow of the scholarly literature unaided, and human intelligence can still be used to amplify interesting signals, or even to improve papers.
Attractive as I find this idea, I am not saying it is historically inevitable, or even the best possible way of ordering these matters. The main point is that peer review does some very important jobs for the community of inquirers (whether or not it evolved to do them), and that if we want to get rid of it, it would be a good idea to have something else ready to do those jobs.
[1]: For instance, many people have suggested that referees should have to take responsibility, in some way, for their reports, so that those who do sloppy or ignorant or merely-partisan work will be at least shamed. There is genuinely a lot to be said for this. But it does run into the conflicting demand that science should not be a respecter of persons --- if Grand Poo-Bah X writes a crappy paper, people should be able to call X on it, without fear of retribution or considering the (inevitable) internal politics of the discipline and the job-market. I do not know if there is a way to reconcile these, but that's one of the kind of trade-offs we have to consider as we try to re-design this institution. ^
Learned Folly; Kith and Kin; The Collective Use and Evolution of Concepts
Posted by crshalizi at May 21, 2012 02:00 | permanent link
Sometimes, all you can do is quote verbatim* from your inbox:
Date: Tue, 17 Apr 2012 09:31:57 -0400 From: Stephen Wolfram To: Cosma Shalizi Subject: 10-year followup on "A New Kind of Science" Next month it'll be 10 years since I published "A New Kind of Science" ... and I'm planning to take stock of the decade of commentary, feedback and follow-on work about the book that's appeared. My archives show that you wrote an early review of the book: http://www.cscs.umich.edu/~crshalizi/reviews/wolfram/ At the time reviews like yours appeared, most of the modern web apparatus for response and public discussion had not yet developed. But now it has, and there seems to be considerable interest in the community in me using that venue to give my responses and comments to early reviews. I'm writing to ask if there's more you'd like to add before I embark on my analysis in the next week or so. I'd like to take this opportunity to thank you for the work you put into writing a review of my book. I know it was a challenge to review a book of its size, especially quickly. I plan to read all reviews with forbearance, and hope that---especially leavened by the passage of a decade---useful intellectual points can be derived from discussing them. If you don't have anything to add to your early review, it'd be very helpful to know that as soon as possible. Thanks in advance for your help. -- Stephen Wolfram P.S. Nowadays you can find the whole book online at http://www.wolframscience.com/nksonline/toc.html If you'd like a new physical copy, just let me know and I can have it shipped...
I wrote my my review in 2002 (though I didn't put it out until 2005). The idea that complex patterns can arise from simple rules was already old then, and has only become more commonplace since. A lot of interesting, substantive, specific science has been done on that theme in the ensuing decade. To this effort, neither Wolfram nor his book have contributed anything of any note. The one respect in which I was overly pessimistic is that I have not, in fact, had to spend much time "de-programming students [who] read A New Kind of Science before knowing any better" — but I get a rather different class of students these days than I did in 2002.
Otherwise, and for the record, I do indeed still stand behind the review.
Manual trackback: Hacker News; Wolfgang; Andrew Gelman
*: I removed our e-mail addresses, because no one deserves spam.
Posted by crshalizi at May 03, 2012 23:10 | permanent link
Attention conservation notice: Boring details about getting finicky statistical software to work; or, please read the friendly manual.
Some of my students are finding it difficult to install the R package pcalg; I share these instructions in case others are also in difficulty.
source("http://bioconductor.org/biocLite.R")(Since RBGL depends on graph, this should automatically also install graph; if not, run biocLite("graph"), then biocLite("RBGL").)
biocLite("RBGL")
You can still extract the graph by hand from the fitted models returned by functions like pc --- if one of those objects is fit, then fit@graph@edgeL is a list of lists, where each node has its own list, naming the other nodes it has arrows to (not from). If you are doing this for the final in ADA, you don't actually need anything beyond this to do the assignment, as explained in question A1a.
source("http://bioconductor.org/biocLite.R")The README for Rgraphviz gives some checks which you should be able to run if everything is working; try them.
biocLite("Rgraphviz")
When I installed pcalg on my laptop two weeks ago, it was painless, because (1) I already had graphviz, and (2) I knew about BioConductor. (In fact, the R graphical interface on the Mac will switch between installing packages from CRAN and from BioConductor.) To check these instructions, I just now deleted all the packages from my computer and re-installed them, and everything worked; elapsed time, ten minutes, mostly downloading.
Posted by crshalizi at May 02, 2012 21:30 | permanent link
In which we are devoted to two problems of political economy, viz., strikes, and macroeconomic forecasting.
Posted by crshalizi at May 01, 2012 10:31 | permanent link
What time series are. Properties: autocorrelation or serial correlation; other notions of serial dependence; strong and weak stationarity. The correlation time and the world's simplest ergodic theorem; effective sample size. The meaning of ergodicity: a single increasing long time series becomes representative of the whole process. Conditional probability estimates; Markov models; the meaning of the Markov property. Autoregressive models, especially additive autoregressions; conditional variance estimates. Bootstrapping time series. Trends and de-trending.
Posted by crshalizi at May 01, 2012 10:30 | permanent link
Attention conservation notice: I have no taste.
Now, why do the various animals do what seem to us such strange things, in the presence of such outlandish stimuli? Why does the hen, for example, submit herself to the tedium of incubating such a fearfully uninteresting set of objects as a nestful of eggs, unless she have some sort of a prophetic inkling of the result? The only answer is ad hominem. We can only interpret the instincts of brutes by what we know of instincts in ourselves. Why do men always lie down, when they can, on soft beds rather than on hard floors? Why do they sit round the stove on a cold day? Why, in a, room, do they place themselves, ninety-nine times out of a hundred, with their faces towards its middle rather than to the wall? Why do they prefer saddle of mutton and champagne to hard-tack and ditch-water? Why does the maiden interest the youth so that everything about her seems more important and significant than anything else in the world? Nothing more can be said than that these are human ways, and that every creature likes its own ways, and takes to the following them as a, matter of course. Science may come and consider these ways, and find that most of them are useful. But it is not for the sake of their utility that they are followed, but because at the moment of following them we feel that that is the only appropriate and natural thing to do. Not one man in a billion, when taking his dinner, ever thinks of utility. He eats because the food tastes good and makes him want more. If you ask him why he should want to eat more of what tastes like that, instead of revering you as a philosopher he will probably laugh at you for a fool. The connection between the savory sensation and the act it awakens is for him absolute and selbstverständlich, an "a priori synthesis" of the most perfect sort, needing no proof but its own evidence. It takes, in short, what Berkeley calls a mind debauched by learning to carry the process of making the natural seem strange, so far as to ask for the why of any instinctive human act. To the metaphysician alone can such questions occur as: Why do we smile, when pleased, and not scowl? Why are we unable to talk to a crowd as we talk to a single friend? Why does a particular maiden turn our wits so upside-down? The common man can only say, "Of course we smile, of course our heart palpitates at the sight of the crowd, of course we love the maiden, that beautiful soul clad in that perfect form, so palpably and flagrantly made from all eternity to be loved!"More soberly, or at least with fewer hens and maggots, this is highly reminiscent of Robert Frank's Passion within Reason, which I do not believe Williams mentions. ^
And so, probably, does each animal feel about the particular things it tends to do in presence of particular objects. They, too, are a priori syntheses. To the lion it is the lioness which is made to be loved; to the bear, the she-bear. To the broody hen the notion would probably seem monstrous that there should be a creature in the world to whom a nestful of eggs was not the utterly fascinating and precious and never-to-be-too-much-sat-upon object which it is to her.
Thus we may be sure that, however mysterious some animals' instincts may appear to us, our instincts will appear no less mysterious to them. And we may conclude that, to the animal which obeys it, every impulse and every step of every instinct shines with its own sufficient light, end seems at the moment the only eternally right and proper thing to do. It is done for its own sake exclusively. What voluptuous thrill may not shake a fly, when she at last discovers the one particular leaf, or carrion, or bit of dung, that out of all the world can stimulate her ovipositor to its discharge? Does not the discharge then seem to her the only fitting thing? And need she care or know anything about the future maggot and its food?
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Enigmas of Chance; Central Asia; Philosophy
Posted by crshalizi at April 30, 2012 23:59 | permanent link
In which the arts of estimating causal effects from observational data are practiced on Sesame Street.
Posted by crshalizi at April 24, 2012 10:31 | permanent link
Estimating graphical models: substituting consistent estimators into the formulas for front and back door identification; average effects and regression; tricks to avoid estimating marginal distributions; propensity scores and matching and propensity scores as computational short-cuts in back-door adjustment. Instrumental variables estimation: the Wald estimator, two-stage least-squares. Summary recommendations for estimating causal effects.
Reading: Notes, chapter 24
Posted by crshalizi at April 24, 2012 10:30 | permanent link
In which we use graphical causal models to understand twin studies and variance components.
Posted by crshalizi at April 22, 2012 12:00 | permanent link
Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Summary recommendations for identifying causal effects.
Reading: Notes, chapter 23
Posted by crshalizi at April 21, 2012 12:00 | permanent link
Attention conservation notice: 2500+ words on estimating how quickly time series forget their own history. Only of interest if you care about the intersection of stochastic processes and statistical learning theory. Full of jargon, equations, log-rolling and self-promotion, yet utterly abstract.
I promised to say something about the content of Daniel's thesis, so let me talk about two of his papers, which go into chapter 4; there is a short conference version and a long journal version.
Recall the world's simplest ergodic theorem: if \( X_t \) is a sequence of random variables with common expectation \( m \) and variance \( v \), and stationary covariance \( \mathrm{Cov}[X_t, X_{t+h}] = c_h \). Then the time average \( \overline{X}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{X_i} \) also has expectation \( m \), and the question is whether it converges on that expectation. The world's simplest ergodic theorem asserts that if the correlation time \[ T = \frac{\sum_{h=1}^{\infty}{|c_h|}}{v} < \infty \] then \[ \mathrm{Var}\left[ \overline{X}_n \right] \leq \frac{v}{n}(1+2T) \]
Since, as I said, the expectation of \( \overline{X}_n \) is \( m \) and its variance is going to zero, we say that \( \overline{X}_n \rightarrow m \) "in mean square".
From this, we can get a crude but often effective deviation inequality, using Chebyshev's inequality: \[ \Pr{\left(|\overline{X}_n - m| > \epsilon\right)} \leq \frac{v}{\epsilon^2}\frac{1+2T}{n} \]
The meaning of the condition that the correlation time \( T \) be finite is that the correlations themselves have to trail off as we consider events which are widely separated in time — they don't ever have to be zero, but they do need to get smaller and smaller as the separation \( h \) grows. (One can actually weaken the requirement on the covariance function to just \( \lim_{n\rightarrow \infty}{\frac{1}{n}\sum_{h=1}^{n}{c_h}} = 0 \), but this would take us too far afield.) In fact, as these formulas show, the convergence looks just like what we'd see for independent data, only with \( \frac{n}{1+2T} \) samples instead of \( n \), so we call the former the effective sample size.
All of this is about the convergence of averages of \( X_t \), and based on its covariance function \( c_h \). What if we care not about \( X \) but about \( f(X) \)? The same idea would apply, but unless \( f \) is linear, we can't easily get its covariance function from \( c_h \). The mathematicians' solution to this has been to invent stronger notions of decay-of-correlations, called "mixing". Very roughly speaking, we say that \( X \) is mixing when, if you pick any two (nice) functions \( f \) and \( g \), I can always show that \[ \lim_{h\rightarrow\infty}{\mathrm{Cov}\left[ f(X_t), g(X_{t+h}) \right]} = 0 \]
Note (or believe) that this is "convergence in distribution"; it happens if, and only if, the distribution of events up to time \( t \) is becoming independent of the distribution of events from time \( t+h \) onwards.
To get useful results, it is necessary to quantify mixing, which is usually done through somewhat stronger notions of dependence. (Unfortunately, none of these have meaningful names. The review by Bradley ought to be the standard reference.) For instance, the "total variation" or \( L_1 \) distance between probability measures \( P \) and \( Q \), with densities \( p \) and \( q \) is, \[ d_{TV}(P,Q) = \frac{1}{2}\int{|p(u) - q(u)| du} \] This has several interpretations, but the easiest to grasp is that it says how much \( P \) and \( Q \) can differ in the probability they give to any one event: for any \( E \), \( d_{TV}(P,Q) \geq |P(E) - Q(E)| \). One use of this distance is to measure how the dependence between random variables, by seeing far their joint distribution is from the product of their marginal distributions. Abusing notation a little to write \( P(U,V) \) for the joint distribution of \( U \) and \( V \), we measure dependence as \[ \beta(U,V) \equiv d_{TV}(P(U,V), P(U) \otimes P(V)) = \frac{1}{2}\int{|p(u,v)-p(u)p(v)|du dv} \] This will be zero just when \( U \) and \( V \) are statistically independent, and one when, on average, conditioning on \( U \) confines \( V \) to a set which would otherwise have probability zero. (For instance if \( U \) has a continuous distribution and \( V \) is a function of \( U \) — or one of two randomly chosen functions of \( U \).)
We can relate this back to the earlier idea of correlations between functions by realizing that \[ \beta(U,V) = \sup_{|r|\leq 1}{\left|\int{r(u,v) dP(U,V)} - \int{r(u,v)dP(U)dP(V)}\right|} ~, \] that \( \beta \) says how much the expected value of a bounded function \( r \) could change between the dependent and the independent distributions. (There is no assumption that the test function \( r \) factorizes, and in fact it's important to allow \( r(u,v) \neq f(u)g(v) \).)
We apply these ideas to time series by looking at the dependence between the past and the future: \[ \begin{eqnarray*} \beta(h) & \equiv & d_{TV}(P(X^t_{-\infty}, X_{t+h}^{\infty}), P(X^t_{-\infty}) \otimes P(X_{t+h}^{\infty})) \\ & = & \frac{1}{2}\int{|p(x^t_{-\infty},x_{t+h}^{\infty})-p(x^t_{-\infty})p(x^{\infty}_{t+h})|dx^t_{-\infty}dx^{\infty}_{t+h}} \end{eqnarray*} \] (By stationarity, the integral actually does not depend on \( t \).) When \( \beta(h) \rightarrow 0 \) as \( h \rightarrow \infty \), we have a "beta-mixing" process. (These are also called "absolutely regular".) Convergence in total variation implies convergence in distribution, but not vice versa, so beta-mixing is stronger than common-or-garden mixing.
Notions like beta-mixing were originally introduced purely for probabilistic convenience, to handle questions like "when does the central limit theorem hold for stochastic processes?" These are interesting for people who like stochastic processes, or indeed for those who want to do Markov chain Monte Carlo and want to know how long to let the chain run. For our purposes, though, what's important is that when people in statistical learning theory have given serious attention to dependent data, they have usually relied on a beta-mixing assumption.
The reason for this focus on beta-mixing is that it "plays nicely" with approximating dependent processes by independent ones. The usual form of such arguments is as follows. We want to prove a result about our dependent but mixing process \( X \). For instance, we realize that our favorite prediction model will tend to do worse out-of-sample than on the data used to fit it, and we might want to bound the probability that this over-fitting will exceed \( \epsilon \). If we know the beta-mixing coefficients \( \beta(h) \), we can pick a separation, call it \( a \), where \( \beta(a) \) is reasonably small. Now we divide \( X \) up into \( \mu = n/a \) blocks of length \( a \). If we take every other block, they're nearly independent of each other (because \( \beta(a) \) is small) but not quite (because \( \beta(a) \neq 0 \)). Introduce a (fictitious) random sequence \( Y \), where blocks of length \( a \) have the same distribution as the blocks in \( X \), but there's no dependence between blocks. Since \( Y \) is an IID process, it is easy for us to prove that, for instance, the probability of over-fitting \( Y \) by more than \( \epsilon \) is at most some small \( \delta(\epsilon,\mu/2) \). Since \( \beta \) tells us about how well dependent probabilities are approximated by independent ones, the probability of the bad event happening with the dependent data is at most \( \delta(\epsilon,\mu/2) + (\mu/2)\beta(a) \). We can make this as small as we like by letting \( \mu \) and \( a \) both grow as the time series gets longer. Basically, anything result which holds for an IID process will also hold for a beta-mixing one, with a penalty in the probability that depends on \( \beta \). There are some details to fill in here (how to pick the separation \( a \)? should the blocks always be the same length as the "filler" between blocks?), but this is the basic frame.
What it leaves open, however, is how to estimate the mixing coefficients \( \beta(h) \). For Markov models, one could it principle calculate it from the transition probabilities. For more general processes, though, calculating beta from the known distribution is not easy. In fact, we are not aware of any previous work on estimating the \( \beta(h) \) coefficients from observational data. (References welcome!) Because of this, even in learning theory, people have just assumed that the mixing coefficients were known, or that it was known they went to zero at a certain rate. This was not enough for what we wanted to do, which was actually calculate bounds on error from data.
There were two tricks to actually coming up with an estimator. The first was to reduce the ambitions a little bit. If you look at the equation for \( \beta(h) \) above, you'll see that it involves integrating over the infinite-dimensional distribution. This is daunting, so instead of looking at the whole past and future, we'll introduce a horizon, \( d \) steps away, and cut things off there: \[ \begin{eqnarray*} \beta^{(d)}(h) & \equiv & d_{TV}(P(X^t_{t-d}, X_{t+h}^{t+h+d}), P(X^t_{t-d}) \otimes P(X_{t+h}^{t+h+d})) \\ & = & \frac{1}{2}\int{|p(x^t_{t-d},x_{t+h}^{t+h+d})-p(x^t_{t-d})p(x^{t+h+d}_{t+h})|dx^t_{t-d}dx^{t+h+d}_{t+h}} \end{eqnarray*} \] If \( X \) is a Markov process, then there's no difference between \( \beta^{(d)}(h) \) and \( \beta(h) \). If \( X \) is a Markov process of order \( p \), then \( \beta^{(d)}(h) = \beta(h) \) once \( d \geq p \). If \( X \) is not Markov at any order, it is still the case that \( \beta^{(d)}(h) \rightarrow \beta(h) \) as \( d \) grows. So we have an approximation to \( \beta \) which only involves finite-dimensional integrals, which we might have some hope of doing.
The other trick is to get rid of those integrals. Another way of writing the beta-dependence between the random variables \( U \) and \( V \) is \[ \beta(U,V) = \sup_{\mathcal{A},\mathcal{B}}{\frac{1}{2}\sum_{a\in\mathcal{A}}{\sum_{b\in\mathcal{B}}{\left| \Pr{(a \cap b)} - \Pr{(a)}\Pr{(b)} \right|}}} \] where \( \mathcal{A} \) runs over finite partitions of values of \( U \), and \( \mathcal{B} \) likewise runs over finite partitions of values of \( V \). I won't try to show that this formula is equivalent to the earlier definition, but I will contend that if you think about how that integral gets cashed out as a sum, you can sort of see how it would be. If we want \( \beta^{(d)}(h) \), we can take \( U = X^{t}_{t-d} \) and \( V = X^{t+h+d}_{t+h} \), and we could find the dependence by taking the supremum over partitions of those two variables.
Now, suppose that the joint density \( p(x^t_{t-d},x_{t+h}^{t+h+d}) \) was piecewise constant, with those pieces being rectangles parallel to the coordinate axes. Then sub-dividing those rectangles would not change the sum, and the \( \sup \) would actually be attained for that particular partition. Most densities are not of course piecewise constant, but we can approximate them by such piecewise-constant functions, and make the approximation arbitrarily close (in total variation). More, we can estimate those piecewise-constant approximating densities from a time series. Those estimates are, simply, histograms, which are about the oldest form of density estimation. We show that histogram density estimates converge in total variation on the true densities, when the bin-width is allowed to shrink as we get more data.
Because the total variation distance is in fact a metric, we can use the triangle inequality to get an upper bound on the true beta coefficient, in terms of the beta coefficients of the estimated histograms, and the expected error of the histogram estimates. All of the error terms shrink to zero as the time series gets longer, so we end up with consistent estimates of \( \beta^{(d)}(h) \). That's enough if we have a Markov process, but in general we don't. So we can let \( d \) grow as \( n \) does, and that (after a surprisingly long measure-theoretic argument) turns out to do the job: our histogram estimates of \( \beta^{(d)}(h) \), with suitably-growing \( d \), converge on the true \( \beta(h) \).
To confirm that this works, the papers go through some simulation examples, where it's possible to cross-check our estimates. We can of course also do this for empirical time series. For instance, in his this Daniel took four standard macroeconomic time series for the US (GDP, consumption, investment, and hours worked, all de-trended in the usual way). This data goes back to 1948, and is measured four times a year, so there are 255 quarterly observations. Daniel estimated a \( \beta \) of 0.26 at one quarter's separation, \( \widehat{\beta}(2) = 0.15 \), \( \widehat{\beta}(3) = 0.02 \), and somewhere between 0 and 0.11 for \(\widehat{\beta}(4) \). (That last is a sign that we don't have enough data to go beyond \( h = 4 \).) Optimistically assuming no dependence beyond a year, one can calculate the effective number of independent data points, which is not 255 but 31. This has morals for macroeconomics which are worth dwelling on, but that will have to wait for another time. (Spoiler: \( \sqrt{\frac{1}{31}} \approx 0.18 \), and that's if you're lucky.)
It's inelegant to have to construct histograms when all we want is a single number, so it wouldn't surprise us if there were a slicker way of doing this. (For estimating mutual information, which is in many ways analogous, estimating the joint distribution as an intermediate step is neither necessary nor desirable.) But for now, we can do it, when we couldn't before.
Posted by crshalizi at April 20, 2012 14:57 | permanent link
Probabilistic prediction is about passively selecting a sub-ensemble, leaving all the mechanisms in place, and seeing what turns up after applying that filter. Causal prediction is about actively producing a new ensemble, and seeing what would happen if something were to change ("counterfactuals"). Graphical causal models are a way of reasoning about causal prediction; their algebraic counterparts are structural equation models (generally nonlinear and non-Gaussian). The causal Markov property. Faithfulness. Performing causal prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules for linear models.
Reading: Notes, chapter 22
Posted by crshalizi at April 15, 2012 20:03 | permanent link
In which the analysis of multivariate data is recursively applied.
Reading: Notes, assignment
Posted by crshalizi at April 15, 2012 20:02 | permanent link
Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth?
Reading: Notes, chapter 21
Posted by crshalizi at April 15, 2012 20:01 | permanent link
From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: planes again. Probabilistic clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.
Extended example: Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components.
Reading: Notes, chapter 20; mixture-examples.R
Posted by crshalizi at April 15, 2012 20:00 | permanent link
On Friday, my student Daniel McDonald, who I have been lucky enough to jointly advise with Mark Schervish, defeated the snake — that is, defended his thesis:
I hope to have a follow-up post very soon about the substance of Daniel's work, which is part of our INET grant, but in the meanwhile: congratulations, Dr. McDonald!
Posted by crshalizi at April 08, 2012 17:25 | permanent link
Posted by crshalizi at April 06, 2012 01:03 | permanent link
Attention conservation notice: 2000 words of advice to larval academics, based on mere guesswork and ill-assimilated psychology.
It being the season for job-interview talks, student exam presentations, etc., the problems novices have with giving them are much on my mind. And since I find myself composing the same e-mail of advice over and over, why not write it out once and for all?
Once you understand the purpose of academic talks, it becomes clear that the two fundamental obstacles to giving good talks are memory and fear.
The point of academic talk is to try to persuade your audience to agree with you about your research. This means that you need to raise a structure of argument in their minds, in less than an hour, using just your voice, your slides, and your body-language. Your audience, for its part, has no tools available to it but its ears, eyes, and mind. (Their phones do not, in this respect, help.)
This is a crazy way of trying to convey the intricacies of a complex argument. Without external aids like writing and reading, the mind of the East African Plains Ape has little ability to grasp, and more importantly to remember, new information. (The great psychologist George Miller estimated the number of pieces of information we can hold in short-term memory as "the magical number seven, plus or minus two", but this may if anything be an over-estimate.) Keeping in mind all the details of an academic argument would certainly exceed that slight capacity*. When you over-load your audience, they get confused and cranky, and they will either tune you out or avenge themselves on the obvious source of their discomfort, namely you.
Therefore, do not overload your audience, and do not even try to convey all the intricacies of a complex academic argument in your talk. The proper goal of an academic talk is to convey a reasonably persuasive sketch of your argument, so that your audience are better informed about the subject, get why they should care, and are usefully oriented to what you wrote if and when they decide to read your paper. In many ways a talk is really an extended oral abstract for your paper. (This is more effective if those who are interested can read your paper, at an open pre-print archive or at least on your website.) Success in this means keeping your audience's load low, and there are two big ways to do that: make it easier for them to remember what matters, and reduce what they have to remember.
People can remember things more easily if they have a scheme they can relate them to, which helps them appreciate their relevance. Your audience will come to the talk with various schemata; use them.
As for limiting the information the audience needs to remember, the main rule is to ask yourself "Do they need to know this to follow the argument?" and "Will they need to remember this later?" If they do not need to know it even for a moment, cut it. (Showing or telling them details, followed by "don't worry about the details", does not work.) If they will need to remember it later, emphasize it, and remind them when you need it.
To answer "Do they need to know this?" and "Will they have to recall this?", you need to be intimately familiar with the logic of your own talk. The ideal of such familiarity is to have that logic committed to memory — the logic, not some exact set of words. When you really understand it, when you grasp all the logical connections and see why everything that's necessary is needed, the argument can "carry you along" through the presentation, letting you compose appropriate words as you go, without rote memorization. This has many advantages, not least the ability to field questions.
As a corollary to limiting what the audience needs to remember, if you are using slides, their text should be (1) prompts for your exposition and your audience's memory, or (2) things which are just too hard to say, like equations**. (Do not, whatever you do, read aloud the text of your slides.) But whether spoken or on the slide, cut your talk down to the essentials. This requires you to know what is essential.
"But the lovely, no the divine, details!" you protest. "All those fine points I checked, all the intricate work I did, all the alternatives I ruled out? When do I get to talk about them?" To which there are several responses.
To sum up on memory, then: successful academic talks persuade your audience of your argument. To do this, and not instead alienate your audience, you have to work with their capacities and prior knowledge, and not against them. Negatively, this means limiting the amount of information you expect them to retain. Positively, you need to use, and make, schemata which help them see the relevance of particulars. You can still give an awful talk this way (maybe your argument is incredibly bad), but you can hardly give a good talk without it.
The major consideration in crafting the content of your talk is your audience's memory. The major consideration for the delivery of the talk is your fear. (Your own memory is not so great, but you have of course internalized the schema for your own talk, and so you can re-generate it as you go, using your slides as prompts.) Public speaking, especially about something important to you, and to an audience whose opinion matters to you, is intimidating to many people. Fear makes you a worse public speaker; you mumble, you forget your place in the argument, you can't think on your feet, you project insecurity (possibly by over-compensating), etc. You do not need to become a great, fearless public speaker; you do need to be adequate at it. The three major routes to doing this, in my experience, are desensitization, dissociation, and deliberate acts.
Desensitization is simple: the more you do it, and emerge unscathed, the less fearful you will be. Practice giving your talks to safe but critical audiences. ("But critical" is key: you need them to tell you honestly what wasn't working well. [Something can always be done better.]) If you can't get a safe-but-critical audience, get an audience you don't care about (e.g., some random conference), and practice on them. Remind yourself, too, that while your talk may be a big deal for you, it's rarely a big deal for your audience.
Dissociation is about embracing being a performer on a stage: the audience's idea of you is already a fictional character, so play a character. It can, once again, be very liberating to separate the persona you're adopting for the talk from the person you actually are. If that seems unethical, go read The Presentation of Self in Everyday Life. An old-fashioned insistence that what really matters are the ideas, and not their merely human vessel, can also be helpful here.
Finally, deliberate actions are partly about communicating better, and partly about a fake-it-till-you-make-it assumption of confidence. (Some of these are culture-bound, so adjust as need be.) Project your voice to be heard through the room. (Don't be ashamed to use a microphone if need be.) Look at your audience (not your shoes or the screen), letting your eyes rove over them to gauge their facial expressions; don't be afraid to maintain eye contact, but keep moving on. Maintain a nearly-conversational speed of talking; avoid long pauses. When fielding questions, don't defer to senior people or impose on your juniors; re-phrase the question before answering, to make sure everyone gets it, and to give yourself time to think about your reply. And for the sake of all that's holy, speak to the audience, not to a screen.
At the outset, I said that the two great obstacles to giving a good talk are memory and fear. The converse is that if you truly understand your own argument, and you truly believe in it, you can convey it in a way which works with your audience's memory, and overcome your own fear. The sheer mechanics of presentation will come with practice, and you will have something worth presenting.
Further reading:
*: Some branches of the humanities and the social sciences have the horrible custom of reading an academic paper out loud, apparently on the theory that this way none of the details get glossed over. The only useful advice which can be given about this is "Don't!". Academic prose has many virtues, but it is simply not designed for oral communication. Moreover, all of your audience consists of people who are very good at reading such prose, and can certainly do so at least as fast as you can recite it. Having people recite their papers, or even prepared remarks written in the style of a paper, does nothing except waste an hour in the life of the speaker and the audience — and none of us has hours to waste. ^
**: As a further corollary, and particularly important in statistics, big tables of numbers (e.g., regression coefficients) are pointless; and here "big" means "larger than 2x2". ^
Manual trackback: Rules of Reason; New APPS; Hacker News; paperpools; Nanopolitan; The Essence of Mathematics Is Its Freedom
Posted by crshalizi at April 04, 2012 01:09 | permanent link
Homework 8: in which returning to paleontology gives us an excuse to work with simulations, and to compare distributions.
Posted by crshalizi at April 03, 2012 23:40 | permanent link
Homework 8: in which we try to predict political orientation
from bumps on the skull the volume of brain regions determined
by MRI and adjusted by (unknown) formulas.
Posted by crshalizi at April 03, 2012 09:20 | permanent link
Adding noise to PCA to get a statistical model. The factor model, or linear regression with unobserved independent variables. Assumptions of the factor model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. Our first look at latent variables and conditional independence. Geometrically, the factor model says the data cluster on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.
Reading: Notes, chapter 19; factors.R and sleep.txt
Posted by crshalizi at April 03, 2012 09:15 | permanent link
Principal components is the simplest, oldest and most robust of dimensionality-reduction techniques. It works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the projection of the data on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.
Reading: Notes, chapter 18; pca.R, pca-examples.Rdata, and cars-fixed04.dat
Posted by crshalizi at April 03, 2012 09:10 | permanent link
Applying the right CDF to a continuous random variable makes it uniformly distributed. How do we test whether some variable is uniform? The smooth test idea, based on series expansions for the log density. Asymptotic theory of the smooth test. Choosing the basis functions for the test and its order. Smooth tests for non-uniform distributions through the transformation. Dealing with estimated parameters. Some examples. Non-parametric density estimation on [0,1]. Checking conditional distributions and calibration with smooth tests. The relative distribution idea: comparing whole distributions by seeing where one set of samples falls in another distribution. Relative density and its estimation. Illustrations of relative densities. Decomposing shifts in relative distributions.
Reading: Notes, chapter 17
Optional reading: Bera and Ghosh, "Neyman's Smooth Test and Its Applications in Econometrics"; Handcock and Morris, "Relative Distribution Methods"
Posted by crshalizi at April 03, 2012 09:00 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Enigmas of Chance; The Progressive Forces Central Asia; The Collective Use and Evolution of Concepts; The Commonwealth of Letters; Writing for Antiquity; Commit a Social Science; Pleasures of Detection, Portraits of Crime
Posted by crshalizi at March 31, 2012 23:59 | permanent link
From the all-too-small Department of Unambiguously Good Things Happening to People Who Thoroughly Deserve Them, Judea Pearl has won the Turing Prize for 2011. As a long-time admirer*, I could not be more pleased, and would like to take this opportunity to recommend his "Causal Inference in Statistics" again.
*: I realize it edges into "I liked Feynman before he joined the Manhattan Project; the Williamsburg Project was edgier" territory, but I have very vivid memories of reading Probabilistic Reasoning in Intelligent Systems in the winter months of early 1999, and being correspondingly excited to hear that the first edition of Causality was coming out...
Posted by crshalizi at March 30, 2012 15:30 | permanent link
Attention conservation notice: Only of interest if you (1) care about high-dimensional statistics and (2) will be in Pittsburgh over the next two weeks.
I am not sure how our distinguished speakers would feel at being called sorcerers, but since one of them is using sparsity to read minds, and the other to infer causation from correlation, it is hard to think of a more appropriate word.
As always, the talks are free and open to the public; hecklers will, however, be turned into newts.
Posted by crshalizi at March 29, 2012 13:10 | permanent link
Attention conservation notice: Only of interest if you (1) care about statistical models of networks or collective information-processing, and (2) will be in Pittsburgh this week.
I am behind in posting my talk announcements:
Enigmas of Chance; Networks; The Collective Use and Evolution of Concepts
Posted by crshalizi at March 26, 2012 10:00 | permanent link
You are a theoretical physicist, trying to do data analysis, and "Such a Shande far de Goyim!" is all I can think after reading your manuscript. Even if it turns out we are playing out this touching scene (which never fails to bring tears to my eyes) — no.
(SMBC via Lost in Transcription)
Update: Thanks to reader R.K. for correcting my Yiddish.
Posted by crshalizi at March 21, 2012 11:49 | permanent link
Homework 7: A little theory, a little methodology, a little data analysis: these keep growing young statisticians healthily balanced.
assignment, n90_pol.csv data
Posted by crshalizi at March 20, 2012 10:31 | permanent link
Simulation: implementing the story encoded in the model, step by step, to produce something data-like. Stochastic models have random components and so require some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random variables. How to generate random variables with specified distributions. Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? Simulation-based estimation: the method of simulated moments.
Reading: Notes, chapter 16; R
Posted by crshalizi at March 20, 2012 10:30 | permanent link
My paper with Aaron Clauset and Mark Newman on power laws has just passed 1000 citations on Google Scholar, slightly ahead of schedule. (Actually, the accuracy of Aaron's prediction is a little creepy.)
I am spending the day reading over my student Daniel McDonald's dissertation draft. The calendar tells me that I was in the middle of writing up my own dissertation in mid-March 2001. But this is impossible, since I could swear that was just a few months ago at most, not eleven years.
Most significant of all, one of my questions has been answered by Guillaume the adaptationist goat.
Posted by crshalizi at March 15, 2012 11:00 | permanent link
The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties: convergence on the true density if the bandwidth shrinks at the right rate; superior performance to histograms; the curse of dimensionality again. An example with cross-country economic data. Kernels for discrete variables. Estimating conditional densities; another example with the OECD data. Some issues with likelihood, maximum likelihood, and non-parametric estimation.
Reading: Notes, chapter 15
Posted by crshalizi at March 08, 2012 10:30 | permanent link
Reminders about multivariate distributions. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; conditional distributions in multivariate Gaussians and linear regression; computational aspects, specifically in R. General methods for estimating parametric distributional models in arbitrary dimensions: moment-matching and maximum likelihood; asymptotics of maximum likelihood; bootstrapping; model comparison by cross-validation and by likelihood ratio tests; goodness of fit by the random projection trick.
Reading: Notes, chapter 14
Posted by crshalizi at March 06, 2012 09:25 | permanent link
Building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.
Reading: Notes, second half of chapter 13; Faraway, chapters 6 and 7
Posted by crshalizi at March 01, 2012 10:30 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Enigmas of Chance; Minds, Brains, and Neurons; The Commonwealth of Letters; Commit a Social Science
Posted by crshalizi at February 29, 2012 23:59 | permanent link
In which we practice our art upon the condition formerly known as juvenile diabetes.
Posted by crshalizi at February 28, 2012 10:31 | permanent link
Iteratively re-weighted least squares for logistic regression re-examined: coping with nonlinear transformations and model-dependent heteroskedasticity. The common pattern of generalized linear models and IRWLS. Binomial and Poisson regression. The extension to generalized additive models.
Reading: Notes, first half of chapter 13; Faraway, section 3.1, chapter 6
Posted by crshalizi at February 28, 2012 10:30 | permanent link
Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model; maximum likelihood; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares; comparing logistic regression to logistic-additive models.
Reading: Notes, chapter 12; Faraway, chapter 2 (skipping sections 2.11 and 2.12)
Posted by crshalizi at February 23, 2012 10:30 | permanent link
In which we examine the fate of the organized working class, by way of review for the midterm.
Assignment, strikes.csv data set
Posted by crshalizi at February 21, 2012 10:30 | permanent link
Attention conservation notice: Intellectuals gathering in Berkeley to argue about "knowledge" and "revolution".
This looks like fun, and if I didn't have conflicting obligations I'd definitely be there.
From Data to Knowledge: Machine-Learning with Real-time & Streaming Applications
May 7-11 2012
On the Campus of the University of California, BerkeleyWe are experiencing a revolution in the capacity to quickly collect and transport large amounts of data. Not only has this revolution changed the means by which we store and access this data, but has also caused a fundamental transformation in the methods and algorithms that we use to extract knowledge from data. In scientific fields as diverse as climatology, medical science, astrophysics, particle physics, computer vision, and computational finance, massive streaming data sets have sparked innovation in methodologies for knowledge discovery in data streams. Cutting-edge methodology for streaming data has come from a number of diverse directions, from on-line learning, randomized linear algebra and approximate methods, to distributed optimization methodology for cloud computing, to multi-class classification problems in the presence of noisy and spurious data.
This conference will bring together researchers from applied mathematics and several diverse scientific fields to discuss the current state of the art and open research questions in streaming data and real-time machine learning. The conference will be domain driven, with talks focusing on well-defined areas of application and describing the techniques and algorithms necessary to address the current and future challenges in the field.
Sessions will be accessible to a broad audience and will have a single track format with additional rooms for breakout sessions and posters. There will be no formal conference proceedings, but conference applicants are encouraged to submit an abstract and present a talk and/or poster.
See the conference page for submission details, schedules, etc.
Via conference organizer and CMU alumnus Joey Richards.
Posted by crshalizi at February 19, 2012 12:44 | permanent link
Attention conservation notice: Only of interest if you (1) like hearing people talk about statistics and machine learning, and (2) will be in Pittsburgh next week.
I have been remiss about advertising upcoming talks.
As always, the talks are free and open to the public.
(You see why I have trouble keeping up with these.)
Posted by crshalizi at February 19, 2012 12:30 | permanent link
In which extinct charismatic megafauna give us an excuse to practice basic programming, bootstrapping, and specification testing.
Posted by crshalizi at February 15, 2012 14:15 | permanent link
Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.
Reading: Notes, chapter 10
Posted by crshalizi at February 15, 2012 14:10 | permanent link
A change to the lecture schedule, by popular demand!
R programs are built around functions: pieces of code that take inputs or arguments, do calculations on them, and give back outputs or return values. The most basic use of a function is to encapsulate something we've done in the terminal, so we can repeat it, or make it more flexible. To assure ourselves that the function does what we want it to do, we subject it to sanity-checks, or "write tests". To make functions more flexible, we use control structures, so that the calculation done, and not just the result, depends on the argument. R functions can call other functions; this lets us break complex problems into simpler steps, passing partial results between functions. Programs inevitably have bugs: debugging is the cycle of figuring out what the bug is, finding where it is in your code, and fixing it. Good programming habits make debugging easier, as do some tricks. Avoiding iteration. Re-writing code to avoid mistakes and confusion, to be clearer, and to be more flexible.
Reading: Notes, chapter 9
Optional reading: Slides from 36-350, introduction to statistical computing, especially through lecture 15.
R for in-class demos (based around the previous problem set)
Posted by crshalizi at February 15, 2012 14:05 | permanent link
Attention conservation notice: Academics with blogs quibbling about obscure corners of applied statistics.
Lurkers in e-mail point me to this pushback against the general pushback against power laws, and ask me to comment. It might be a mistake to do so, but I'm feeling under the weather and so splenetic, so I will.
In our paper, we looked at 24 quantities which people claimed showed power law distributions. Of these, there were seven cases where we could flat-out reject a power law, without even having to consider an alternative, because the departures of the actual distribution from even the best-fitting power law was much too large to be explained away as fluctuations. (One of the wonderful thing about a stochastic model is that it tells you how big its own errors should be.) In contrast, there was only one data set where we could rule out the log-normal distribution.
In some of those cases, you can patch things up, sort of, by replacing a pure power law with a power-law with an exponential cut-off. That is, rather than the probability density being proportional to x-a, it's proportional to x-ae-x/L. (Either way, I am only talking about the probability density in the "right tail", i.e., for x above some xmin.) This gives the infamous straight-ish patch on a log-log plot, for values of x much smaller than L, but otherwise it has substantially different properties. In ten of the twelve cases we looked at, the only way to save the idea of a power-law at all is to include this exponential cut-off. But that exponentially-shrinking factor is precisely what squelches the WTF, X IS ELEVENTY TIMES LARGER THAN EVER! THE BIG ONE IS IN OUR BASE KILLING OUR DOODZ!!!!1!! mega-events. There were ten more cases where we judged the support for power laws as "moderate", meaning "the power law is a good fit but that there are other plausible alternatives as well" (pardon the self-quotation.) Again, those alternatives, like log-normals and stretched exponentials, give very different tail-behavior, with not so much OMG DOOM.
We found exactly one case where the statistical evidence for the power-law was "good", meaning that "the power law is a good fit and that none of the alternatives considered is plausible", which was Zipf's law of word frequency distributions. We were of course aware that when people claim there are power laws, they usually only mean that the tail follows a power law. This is why all these comparisons were about how well the different distributions fit the tail, excluding the body of the data. We even selected where "the tail" begins to maximize the fit to a power law for each case. Even so, there was just this one case where the data compelling support a power law tail.
(All of this — the meaning of "with cut-off", the meaning of our categorizations, the fact that we only compare the tails, etc. — is clear enough from our paper, if you actually read the text. Or even just the tables and their captions.)
I bring up the OMG DOOM because some people, Hanson very much included, like to extrapolate from supposed power laws for various Bad Things to scenarios where THE BIG ONE kills off most of humanity. But, at least with the data we found, the magnitudes of forest fires, solar flares, earthquakes and wars were all better fit by log-normals, by stretched exponentials and by cut-off power laws than by power laws. For fires, flares and quakes, the differences are large enough that they clearly fall into the "with cut-off only" category. The differences in fits for the war-death data are smaller, as (mercifully) is the sample size, so we put it in the "moderate" support category. If you had some compelling other reason to insist on a power law rather than (e.g.) a log-normal there, the data wouldn't slap you down, but they wouldn't back you up either.
Now, I relish the schadenfreude-laden flavors of a mega-disaster scenario as much as the next misanthropic, science-fiction-loving geek, especially when it's paired with some "The fools! Can't they follow simple math?" on the side. Truly, I do. But squeezing that savory, juicy DOOM out of (for instance) the distribution of solar flares relies on the shape of the tail, i.e., whether it's a pure power law or not. The weak support, in the data, for such powers law means you don't really have empirical evidence for your scenarios, and in some cases what evidence there is tells against them. It's a free country, so you can go on telling those stories, but don't pretend that they owe more to confronting hard truths than to literary traditions.
Posted by crshalizi at February 15, 2012 14:00 | permanent link
Attention conservation notice: 1500 word pedagogical-statistical rant, with sarcasm, mathematical symbols, computer code, and a morally dubious affectation of detachment from the human suffering behind the numbers. Plus the pictures are boring.
Does anyone know when the correlation coefficient is useful, as opposed to when it is used? If so, why not tell us?
— Tukey (1954: 721)
If you have taken any sort of statistics class at all, you have probably been exposed to the idea of the "proportion of variance explained" by a regression, conventionally written R2. This has two definitions, which happen to coincide for linear models fit by least squares. The first is to take the correlation between the model's predictions and the actual values (R) and square it (R2), getting a number which is guaranteed to be between 0 and 1. You get 1 only when the predictions are perfectly correlated with reality, and 0 when there is no linear relationship between them. The other definition is the ratio of the variance of the predictions to the variance of the actual values. It is this latter which leads to the notion that R2 is the proportion of variance explained by the model.
The use of the word "explained" here is quite unsupported and often actively misleading. Let me go over some examples to indicate why.
Start by supposing that a linear model is true:
Well, no. The answer depends on the variance of X, which it will be convenient to call v. The variance of the predictions is b2 v, but the variance of Y is larger, b2 v + s. The ratio is \[ R^2 = \frac{b^2 v}{b^2v + s} \] (You can check that this is also the squared correlation between the predictions and Y.) As v shrinks, this tends 0/s = 0. As v grows, this tends to 1. The relationship between X and Y doesn't change, the accuracy and precision with which Y can be predicted from X do not change, but R2 can wander all through its range, just depending on how dispersed X is.
Now, you say, this is a silly algebraic curiosity. Never mind the Good Fairy of Statistical Modeling handing us the correct parameters, let's talk about something gritty and real, like death in Chicago.
![]() |
| Number of deaths each day in Chicago, 1 January 1987--31 December 2000, from all causes except accidents. (Click this and all later figures for larger PDF versions. See below for link to code.) |
I can relate deaths to time in any number of ways; the next figure shows what I get when I use a smoothing spline (and use cross-validation to pick how much smoothing to do). The statistical model is
|
| As before, but with the addition of a smoothing spline. |
The root-mean-square error of the smoothing spline is just above 12 deaths/day. The R2 of the fit is either 0.35 (squared correlation between predicted and actual deaths) or 0.33 (variance of predicted deaths over variance of actual deaths). It seems absurd, however, to say that the date explains how many people died in Chicago on a given day, or even the variation from day to day. The closest I can come up with to an example of someone making such a claim would be an astrologer, and even one of them would work in some patter about the planets and their influences. (Numerologists, maybe? I dunno.)
Worse is to follow. The same data set which gives me these values for Chicago includes other variables, such as the concentration of various atmospheric pollutants and temperature. I can fit an additive model, which tries to tease out the separate relationships between each of those variables and deaths in Chicago, without presuming a particular functional form for each relationship. In particular I can try the model
The R2 of this model is 0.27. Is this "variance explained"? Well, it's at least not incomprehensible to talk about changes in temperature or pollution explaining changes in mortality. In fact, adding this model's predictions to the simple spline's, we see that most of what the spline predicted from the date is predictable from pollution and temperature:
|
| Black dots: actual death counts. Red curve: spline smoothing on the date alone. Blue lines: predictions from the temperature-and-pollution model. |
We could, in fact, try to include the date in this larger model:
|
|
|
|
Despite the lack of visual drama, putting a smooth function of time back into the model increases R2, from 0.27 to 0.30. Formally, the date enters into the model in exactly the same way as particulate pollution. But, again, only a fortune teller — an unusually numerate fortunate teller, perhaps a subscriber to the Journal of Evidence-Based Haruspicy — would say that the date explains, or helps explain, 3% of the variance.
I hope that by this point you will at least hesitate to think or talk about R2 as "the proportion of variance explained". (I will not insist on your never talking that way, because you might need to speak to the deluded in terms they understand.) How then should you think about it? I would suggest: the proportion of variance retained, or just kept, by the predictions. Linear regression is a smoothing method. (It just smoothes everything on to a line, or more generally a hyperplane.) It's hard for any smoother to give fitted values which have more variance than the variable it is smoothing. R2is merely the fraction of the target's variance which is not smoothed away.
This of course raises the question of why you'd care about this number at all. If prediction is your goal, then it would seem much more natural to look at mean squared error. (Or really root mean squared error, so it's in the same units as the variable predicted.) Or mean absolute error. Or median absolute error. Or a genuine loss function. If on the other hand you want to get some function right, then your question is really about mis-specification, and/or confidence sets of functions, and not about whether your smoother is following every last wiggle of the data at all. If you want an explanation, the fact that there is a peak in deaths every year of about the same height, but the predictions fall short of it, suggests that this model is missing something. The fact that the data shows something awful happened in 1995 and the model has nothing adequate to say about it suggests that whatever's missing is very important.
Code for reproducing the figures and analyses in R. (I make this public, despite the similarity of this exercise to the last problem-set in advanced data analysis, because (i) it's not exactly the same, (ii) the homework is due in ten hours, (iii) none of my students would dream of copying this and turning it in as their own, and (iv) I borrowed the example from Simon Wood's Generalized Additive Models.)
Manual trackback: Bob O'Hara; Siris
Posted by crshalizi at February 13, 2012 23:54 | permanent link
1. I'd like to say that you have no idea how long I have waited to read something like this piece by Michael Stumpf and Mason Porter in one of the glossy journals. But that would be a lie, because if you've been reading this for any length of time, you know that the answer is, long enough to be very tiresome about it. If the referees, and still more the editors, at those journals can be persuaded to pay attention, we will be on track for my mid-2007 hope that "in five to ten years even science journalists and editors of Wired will begin to get the message." (I never really had any hopes for Wired.)
2. You can imagine how my heart sank to see that Krugman had a post titled "The Power (Law) of Twitter" — and my relief to see that he's not actually saying that the distribution of followers is a power law. It is however interesting that the distribution is so close to a log-normal.
3. My ex-boss and mentor Melanie Mitchell has a blog, and promises a substantive series of posts on power laws and scaling. In the meanwhile, go read her book.
Update, 15 February: see later post.
Manual trackback: Brendan O'Connor
(Nos. 1 and 2 via too many to list.)
Posted by crshalizi at February 13, 2012 20:40 | permanent link
The "curse of dimensionality" limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimensionality. Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This generalizes linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Examples in R using the California house-price data. Conclusion: there are no statistical reasons to prefer linear models to additive models, hardly any scientific reasons, and increasingly few computational ones; the continued thoughtless use of linear regression is a scandal.
Reading: Notes, chapter 8; Faraway, chapter 12
Posted by crshalizi at February 09, 2012 10:30 | permanent link
In which spline regression becomes a matter of life and death in Chicago.
Posted by crshalizi at February 07, 2012 10:31 | permanent link
Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression.
Reading: Notes, chapter 7; Faraway, section 11.2.
Posted by crshalizi at February 07, 2012 10:30 | permanent link
Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.
Reading: Notes, chapter 6; Faraway, section 11.3.
Posted by crshalizi at February 02, 2012 10:30 | permanent link
Attention conservation notice: I have no taste.
The IdiadShall I write a poem about you
And your epic struggle against stupidity?
Feh. But if the brain is a city
I too have rooms in the swampy part, surrounded by crocodiles.
The monarch butterflies sail down from the Canadian Rockies
To overwinter in Pacific Grove, pair off and fly away;
They bruise me. I get crankier.
If you are coming down through the narrows of the Saugatuck
Please text me beforehand,
And I will come out to meet you
As far as Palookaville.
Posted by crshalizi at January 31, 2012 23:59 | permanent link
In which we consider evolutionary trends in body size, aided by regression modeling and the bootstrap.
Posted by crshalizi at January 31, 2012 19:11 | permanent link
Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping. Non-parametric bootstrapping. Many examples. When does the bootstrap fail?
Reading: Notes, chapter 5 (R for figures and examples; pareto.R; wealth.dat)<; R for in-class examples
Posted by crshalizi at January 31, 2012 19:10 | permanent link
Fortunately, however, the methods of those who can handle big data are neither grotesque nor incomprehensible, and we will hear about them on Monday.
As always, the talk is free and open to the public.
Posted by crshalizi at January 31, 2012 19:00 | permanent link
Attention conservation notice: Only of interest if you (1) care about combinatorial stochastic processes and their statistical applications, and (2) will be in Pittsburgh on Wednesday afternoon.
It is only in very special weeks, when we have been very good, that we get two seminars.
As always, the talk is free and open to the public.
Posted by crshalizi at January 31, 2012 18:45 | permanent link
Attention conservation notice: Associate editor at a non-profit scientific journal endorses a call for boycotting a for-profit scientific journal publisher.
I have for years been refusing to publish in or referee for journals publisher by Elsevier; pretty much all of the commercial journal publishers are bad deals1, but they are outrageously worse than most. Since learning that Elsevier had a business line in putting out publications designed to look like peer-reviewed journals, and calling themselves journals, but actually full of paid-for BS, I have had a form letter I use for declining requests to referee, letting editors know about this, and inviting them to switch to a publisher which doesn't deliberately seek to profit by corrupting the process of scientific communication.
I am thus extremely happy to learn from Michael Nielsen that Tim Gowers is organizing a general boycott of Elsevier, asking people to pledge not to contribute to its journals, referee for them, or do editorial work for them. You can sign up here, and I strongly encourage you to do so. There are fields where Elsevier does publish the leading journals, and where this sort of boycott would be rather more personally costly than it is in statistics, but there is precedent for fixing that. Once again, I strongly encourage readers in academia to join this.
(To head off the inevitable mis-understandings, I am not, today, calling for getting rid of journals as we know them. I am saying that Elsevier is ripping us off outrageously, that conventional journals can be published without ripping us off, and so we should not help Elsevier to rip us off.)
Disclaimer, added 29 January: As I should have thought went without saying, I am speaking purely for myself here, and not with any kind of institutional voice. In particular, I am not speaking for the Annals of Applied Statistics, or for the IMS, which publishes it. (Though if the IMS asked its members to join in boycotting Elsevier, I would be very happy.)
1: Let's review how scientific journals work, shall we? Scientists are not paid by journals to write papers: we do that as volunteer work, or more exactly, part of the money we get for teaching and from research grants is supposed to pay for us to write papers. (We all have day-jobs.) Journals are edited by scientists, who volunteer for this and get nothing from the publisher. (New editors get recruited by old editors.) Editors ask other scientists to referee the submissions; the referees are volunteers, and get nothing from the publisher (or editor). Accepted papers are typeset by the authors, who usually have to provide "camera-ready" copy. The journal publisher typically provides an electronic system for keeping track of submitted manuscripts and the refereeing process. Some of them also provide a minimal amount of copy-editing on accepted papers, of dubious value. Finally, the publisher actually prints the journal, and runs the server distributing the electronic version of the paper, which is how, in this day and age, most scientists read it. While the publisher's contribution isn't nothing, it's also completely out of proportion to the fees they charge, let alone economically efficient pricing. The whole thing would grind to a halt without the work done by scientists, as authors, editors and referees. That work, to repeat, is paid for either by our students or by our grants, not by the publisher. This makes the whole system of for-profit journal publication economically insane, a check on the dissemination of knowledge which does nothing to encourage its creation. Elsevier is simply one of the worst of these parasites.
Manual trackback: Cosmic Variance; Open A Vein; AgroEcoPeople; QED Insight
Posted by crshalizi at January 28, 2012 11:15 | permanent link
Attention conservation notice: Only of interest if you (1) care about covariance matrices and (2) will be in Pittsburgh on Monday.
Since so much of multivariate statistics depends on patterns of correlation among variables, it is a bit awkward to have to admit that in lots of practical contexts, correlations matrices are just not very stable, and can change quite drastically. (Some people pay a lot to rediscover this.) It turns out that there are more constructive responses to this situation than throwing up one's hands and saying "that sucks", and on Monday a friend of the department and general brilliant-type-person will be kind enough to tell us about them:
As always, the talk is free and open to the public.
Posted by crshalizi at January 27, 2012 14:25 | permanent link
The constructive alternative to complaining about linear regression is non-parametric regression. There are many ways to do this, but we will focus on the conceptually simplest one, which is smoothing; especially kernel smoothing. All smoothers involve local averaging of the training data. The bias-variance trade-off tells us that there is an optimal amount of smoothing, which depends both on how rough the true regression curve is, and on how much data we have; we should smooth less as we get more information about the true curve. Knowing the truly optimal amount of smoothing is impossible, but we can use cross-validation to select a good degree of smoothing, and adapt to the unknown roughness of the true curve. Detailed examples. Analysis o how quickly kernel regression converges on the truth. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.
Readings: Notes, chapter 4 (R); Faraway, section 11.1
Optional readings: Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
Posted by crshalizi at January 26, 2012 10:30 | permanent link
In which we try to discern whether poor countries grow faster.
Posted by crshalizi at January 26, 2012 09:30 | permanent link
Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection. Justifying model-based inferences; Luther and Süleyman.
Reading: Notes, chapter 3 (R for examples and figures).
Posted by crshalizi at January 24, 2012 10:30 | permanent link
Multiple linear regression: general formula for the optimal linear predictor. Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions. Linear regression coefficients will change with the distribution of the input variables: examples. Why R2 is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable problems). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means.
Reading: Notes, chapter 2 (R for examples and figures); Faraway, chapter 1 (continued).
Posted by crshalizi at January 24, 2012 10:15 | permanent link
Attention conservation notice: A silly idea about gamifying credit cards, which would be evil if it worked.
To make a profit in an otherwise competitive industry, it helps if you can impose switching costs on your customers, making them either pay to stop doing business with you, or give up something of value to them. There are whole books about this, written by respected economists1.
This is why credit card companies are happy to offer rewards for use: accumulating points on a card, which would not move with you if you got a new card and transferred the balance, is an attempt to create switching costs. Unfortunately, from the point of view of the banks, people will redeem their points from time to time, so some money must be spent on the rewards. The ideal would be points which people would value but which would never cost the bank anything.
Item: Computer games are, deliberately, addictive. Social games are especially addictive.
Accordingly, if I were an evil and unscrupulous credit card company (but I repeat myself), I would create an online game, where people could get points either from playing the game, or from spending money with my credit card. For legal reasons, I think it would probably be best to allow the game to technically be open to everyone, but with a registration fee which is, naturally, waived for card-holders. Of course, the game software would be set up to announce on Facebook (etc.) whenever the player/debtor leveled up. I would also be tempted to award double points for fees, and triple for interest charges, but one could experiment with this. If they close their credit card account, they have to start the game over from the beginning.
The fact that online acquaintances can't tell whether the debtor is advancing through spending or through game-play helps keep the reward points worth having. It's true that the credit card company has to pay for the game's design (a one-time start-up cost) and the game servers, but these are fairly cheap, and the bank never has to cash out points in actual dollars or goods. The debtors themselves do all the work of investing the points with meaning and value. They impose the switching costs on themselves.
My plan is sheer elegance in its simplicity, and I will be speaking to an attorney about a business method patent first thing Monday.
1: Much can be learned about our benevolent new-media overlords from the fact that this book carries a blurb from Jeff Bezos of Amazon, and that Varian now works for Google.
Posted by crshalizi at January 22, 2012 10:15 | permanent link
Attention conservation notice: An academic paper you've never heard of, about a distressing subject, had bad statistics and is generally foolish.
Because my so-called friends like to torment me, several of them made sure that I knew a remarkably idiotic paper about power laws was making the rounds, promoted by the ignorant and credulous, with assistance from the credulous and ignorant, supported by capitalist tools:
Let's see if we can't stop this before it gets too far, shall we? The serial killer in question is one Andrei Chikatilo, and that Wikipedia article gives the dates of death of his victims, which seems to have been Simkin and Roychowdhury's data source as well. Several of these are known only imprecisely, so I made guesses within the known ranges; the results don't seem to be very sensitive to the guesses. Simkin and Roychowdhury plotted the distribution of days between killings in a binned histogram on a logarithmic scale; as we've explained elsewhere, this is a bad idea, which destroys information to no good purpose, and a better display is shows the (upper or complementary) cumulative distribution function1, which looks like so:
When I fit a power law to this by maximum likelihood, I get an exponent of 1.4, like Simkin and Roychowdhury; that looks like this:
On the other hand, when I fit a log-normal (because Gauss is not mocked), we get this:
After that figure, a formal statistical test is almost superfluous, but let's do it anyway, because why just trust our eyes when we can calculate? The data are better fit by the log-normal than by the power-law (the data are e10.41 or about 33 thousand times more likely under the former than the latter), but that could happen via mere chance fluctuations, even when the power law is right. Vuong's model comparison test lets us quantify that probability, and tells us a power-law would produce data which seems to fit a log-normal this well no more than 0.4 percent2 of the time. Not only does the log-normal distribution fit better than the power-law, the difference is so big that it would be absurd to try to explain it away as bad luck. In absolute terms, we can find the probability of getting as big a deviation between the fitted power law and the observed distribution through sampling fluctuations, and it's about 0.03 percent2b [R code for figures, estimates and test, including data.]
Since Simkin and Roychowdhury's model produces a power law, and these data, whatever else one might say about them, are not power-law distributed, I will refrain from discussing all the ways in which it is a bad model. I will re-iterate that it is an idiotic paper — which is different from saying that Simkin and Roychowdhury are idiots; they are not and have done interesting work on, e.g., estimating how often references are copied from bibliographies without being read by tracking citation errors4. But the idiocy in this paper goes beyond statistical incompetence. The model used here was originally proposed for the time intervals between epileptic fits. The authors realize that
[i]t may seem unreasonable to use the same model to describe an epileptic and a serial killer. However, Lombroso [5] long ago pointed out a link between epilepsy and criminality.That would be the 19th-century pseudo-scientist3 Cesare Lombroso, who also thought he could identify criminals from the shape of their skulls; for "pointed out", read "made up". Like I said: idiocy.
As for the general issues about power laws and their abuse, say something once, why say it again?
Update 9 pm that day: Added the goodness-of-fit test (text
before note 2b, plus that note), updated code, added PNG versions of figures,
added attention conservation notice.
21 January: typo fixes (missing pronoun, mis-placed decimal point), added
bootstrap confidence interval for exponent, updated code accordingly.
Manual trackback: Hacker News (do I really need to link to this?), Naked Capitalism (?!); Mathbabe; Wolfgang Beirl; Ars Mathematica (yes, I am that predictable); Improbable Research (I am not worthy)
1: This is often called the "survival function", but that seems inappropriate here.
2: On average, the log-likelihood of each observation was 0.20 higher under the log-normal than under the power law, and the standard deviation of the log likelihood ratio over the samples was only 0.54. The test statistic thus comes out to -2.68, and the one-sided p-value to 0.36%.
2b: Use a Kolmogorov-Smirnov test. Since the power law has a parameter estimated from data (namely, the exponent), we can't just plug in to the usual tables for a K-S test, but we can find a p-value by simulating the power law (as in my paper with Aaron and Mark), and when I do that, with a hundred thousand replications, the p-value is about 3*10-4.
3: There are in fact subtle, not to say profound, issues in the sociology and philosophy of science here: was Lombroso always a pseudo-scientist, because his investigations never came up to any acceptable standard of reliable inquiry? Or just because they didn't come up to the standards of inquiry prevalent at the time he wrote? Or did Lombroso become a pseudo-scientist, when enough members of enough intellectual communities woke up from the pleasure of having their prejudices about the lower orders echoed to realize that he was full of it? However that may be, this paper has the dubious privilege of being the first time I have ever seen Lombroso cited as an authority rather than a specimen.
4: Actually, for several years my bibliography data base had the wrong page numbers for one of my own papers, due to a typo, so their method would flag some of my subsequent works as written by someone who had cited that paper without reading it, which I assure you was not the case. But the idea seems reasonable in general.
Posted by crshalizi at January 17, 2012 20:23 | permanent link
In which we practice the art of linear regression upon the California real-estate market, by way of warming up for harder tasks.
(Yes, the data set is now about as old as my students, but last week in
Austin I was too busy drinking on 6th street having lofty
conversations about the future of statistics to update the file with
the UScensus2000
package.)
Posted by crshalizi at January 17, 2012 10:31 | permanent link
Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.
Readings: Notes, chapter 1; Faraway, chapter 1, through page 17.
Posted by crshalizi at January 17, 2012 10:30 | permanent link
If you sent me e-mail at my @stat.cmu.edu address in the last few days, I haven't gotten it, and may never get it. The address firstinitiallastname at cmu dot edu now points somewhere where I can read.
Posted by crshalizi at January 07, 2012 20:40 | permanent link
I'll be speaking at UT-Austin next week, through the kindness of the division of statistics and scientific computation:
This will of course be based on my paper with Alessandro, but since I understand some non-statisticians may sneak in, I'll try to be more comprehensible and less technical.
Since this will be my first time in Austin (indeed my first time in Texas), and I have (for a wonder) absolutely no obligations on the 12th, suggestions on what I should see or do would be appreciated.
Posted by crshalizi at January 06, 2012 14:15 | permanent link
It's that time again:
Fuller details on the class homepage, including a detailed (but subject to change) list of topics, and links to the compiled course notes. I'll post updates here to the notes for specific lectures and assignments, like last time.
This is the same course I taught last spring, only grown from sixty-odd students to (currently) ninety-three (from 12 different majors!). The smart thing for me to do would probably be to change nothing (I haven't gotten to re-teach a class since 2009), but I felt the urge to re-organize the material and squeeze in a few more topics.
The biggest change I am making is introducing some quality-control sampling. The course is to big for me to look over much of the students' work, and even then, that gives me little sense of whether the assignments are really probing what they know (much less helping them learn). So I will be randomly selecting six students every week, to come to my office and spend 10--15 minutes each explaining the assignment to me and answering live questions about it. Even allowing for students being randomly selected multiple times*, I hope this will give me a reasonable cross-section of how well the assignments are working, and how well the grading tracks that. But it's an experiment and we'll see how it goes.
* (exercise for the student): Find the probability distribution of the number of times any given student gets selected. Assume 93 students, with 6 students selected per week, and 14 weeks. (Also assume no one drops the class.) Find the distribution of the total number of distinct students who ever get selected.
Posted by crshalizi at January 03, 2012 23:00 | permanent link
Attention conservation notice: Navel-gazing.
Paper manuscripts completed: 12
Papers accepted: 2 [i, ii], one from last year
Papers rejected: 10 (fools! I'll show you all!)
Papers rejected with a comment from the editor that no one should take the
paper I was responding to, published in the same glossy high-impact journal,
"literally": 1
Papers in refereeing limbo: 4
Papers in progress: I won't look in that directory and you can't make me
Grant proposals submitted: 3
Grant proposals rejected: 4 (two from last year)
Grant proposals in refereeing limbo: 1
Grant proposals in progress for next year: 3
Talk given and conferences attended: 20, in 14 cities
Manuscripts refereed: 46, for 18 different journals and conferences
Manuscripts waiting for me to referee: 7
Manuscripts for which I was the responsible associate editor
at Annals of Applied
Statistics: 10
Book proposals reviewed: 3
Classes taught: 2
New classes taught: 2
Summer school classes taught: 1
New summer school classes taught: 1
Pages of new course material written: about 350
Students who are now ABD: 1
Students who are not just ABD but on the job market: 1
Letters of recommendation written: 8 (with about 100 separate destinations)
Promotion packets submitted: 1 (for promotion to associate professor, but without tenure)
Promotion cases still working through the system: 1
Book reviews published on dead trees: 2 [i, ii]
Non-book-reviews published on dead trees: 1
Weblog posts: 157
Substantive weblog posts: 54, counting algal
growths
Books acquired: 298
E-book readers gratefully received: 1
Books driven by my mother from her house to Pittsburgh: about 800
Books begun: 254
Books finished: 204 (of which 34 on said e-book reader)
Books given up: 16
Books sold: 133
Books donated: 113
Book manuscripts completed: 0
Wisdom teeth removed: 4
Unwise teeth removed: 1
Major life transitions: 0
Posted by crshalizi at January 01, 2012 12:00 | permanent link