May 21, 2012

If Peer Review Did Not Exist, We Would Have to Invent Something Very Like It to Serve Highly Similar Ends

Attention conservation notice: 1400 words on a friend's proposal to do away with peer review, written many weeks ago when there was actually some debate about this.

Larry is writing about peer review (again), this time to advocate "A World Without Referees". Every scientist, of course, has day-dreamed about this, in a first-lets-kill-all-the-lawyers way, but Larry is serious, so let's treat this seriously. I'm not going to summarize his argument; it's short and you can and should go read it yourself.

I think it helps, when thinking about this, to separate two functions peer-reviewed journals and conferences have traditionally served. One is spreading claims (dissemination), and the other is letting readers know about claims worthy of their attention (certification).

Arxiv, or something like it, can take over dissemination handily. Making copies of papers is now very cheap and very fast, so we no longer have to be choosy about which ones we disseminate. In physics, this use of Arxiv is just as well-established as Larry says. In fact, one reason Arxiv was able to establish itself so rapidly and thoroughly among physicists was that they already had a well-entrenched culture of circulating preprints long before journal publication. What Arxiv did was make this public and universally accessible.

But physicists still rely on journals for certification. People pay more attention to papers which come out in Physical Review Letters, or even just Physical Review E, than ones which are only on Arxiv. "Could it make it past peer review?" is used by many people as a filter to weed out the stuff which is too wrong or too unimportant to bother with. This doesn't work so well for those directly concerned with a particular research topic, but if something is only peripherally of interest, it makes a lot of sense.

Even within a specialized research community, consisting entirely of experts who can evaluate new contributions on their own, there is a rankling inefficiency to the world without referees. Larry talks about spending a minute or two looking at new stats. papers on Arxiv every day. But everyone filtering Arxiv for themselves is going to get harder and harder as more potentially-relevant stuff gets put on it. I'm interested in information theory, so I've long looked at cs.IT, and it's become notably more time-consuming as that community has embraced the Arxiv. Yet within any given epistemic community, lots of people are going to be applying very similar filters. So the world-without-referees has an increasing amount o work being done by individuals, but a lot of that work is redundant. Efficiency, the division of labor, points to having a few people put their time into filtering, and the rest of us relying on it, even when in principle we could do the filtering ourselves. To be fair, of course, we should probably take this job in turns...

So: if all papers get put on Arxiv, filtering becomes a big job, so efficiency pushes us towards having only some members of the research community do the filtering for the rest. We have re-invented something very much like peer review, purely so that our lives are not completely consumed by evaluating new papers, and we can actually get some work done.

Larry's proposal for a world without referees also doesn't seem to take into account the needs of researchers to rely on findings in fields in which they are not experts, and so can't act as their own filters. (Or they could if they put in a few years in something else first.) If I need some result from neuroscience, or for that matter from topology, I do not have the time to spend becoming a neuroscientist or topologist, and it is an immense benefit to have institutions I can trust to tell me "these claims about cortical columns, or locally compact Hausdorff spaces, are at least not crazy". This is also a kind of filtering, and there is the same push, based on the division of labor, to rely on only some neuroscientists or topologists to do the filtering for outsiders (or all of them only some of the time), and again we have re-created something very much like refereeing.

So: some form or forms of filtering is inevitable, and the forces pushing for a division of labor in filtering are very strong. I don't know of any reason to think that the current, historically-evolved peer review system is the best way of organizing this cognitive triage, but we're not going to avoid having some such system, nor should we want to. Different ways of organizing the work of filtering will have different costs and benefits, but we should be talking about those and those trade-offs, not hoping that we can just wish the problem away now that making copies is cheap1. It's not at all obvious, for instance, that attention-filtering for the internal benefit of members of a research community should be done in the same way as reliability-filtering for outsiders. But, to repeat, we are going to have filters and they are almost certainly going to involve a division of labor.

Lenin, supposedly, said that "small production engenders capitalism and the bourgeoisie daily, hourly, spontaneously and on a mass scale" (Nove, The Economics of Feasible Socialism Revisited, p. 46). Whether he was right about the bourgeoisie or not, the rate of production of the scientific literature, the similarity of interests and standards with a community, and the need to rely on other field's findings are all doing to engender refereeing-like institutions, "daily, hourly, spontaneously and on a mass scale". I don't think Larry would go to the same lengths to get rid of referees that Lenin went to get rid of the bourgeoisie, but in any case the truly progressive course is not to suppress the old system by force, but to provide a superior alternative.

Speaking personally, I am attracted to a scenario we might call "peer review among consenting adults". Let anyone put anything on Arxiv (modulo the usual crank-screen). But then let others create filtered versions, applying such standards of topic, rigor, applicability, writing quality, etc., as they please --- and be explicit about what those standards are. These can be layered as deep as their audience can support. Presumably the later filters would be intended for those further from active research in the area, and so would be less tolerant of false alarms, and more tolerant of missing possible discoveries, than the filters for those close to the work. But this could be an area for experiment, and for seeing what people actually find useful. This is, I take it, more or less what Paul Ginsparg proposes, and it has a lot to recommend it. Every contribution is available if anyone wants to read it, but no one is compelled to try to filter the whole flow of the scholarly literature unaided, and human intelligence can still be used to amplify interesting signals, or even to improve papers.

Attractive as I find this idea, I am not saying it is historically inevitable, or even the best possible way of ordering these matters. The main point is that peer review does some very important jobs for the community of inquirers (whether or not it evolved to do them), and that if we want to get rid of it, it would be a good idea to have something else ready to do those jobs.

[1]: For instance, many people have suggested that referees should have to take responsibility, in some way, for their reports, so that those who do sloppy or ignorant or merely-partisan work will be at least shamed. There is genuinely a lot to be said for this. But it does run into the conflicting demand that science should not be a respecter of persons --- if Grand Poo-Bah X writes a crappy paper, people should be able to call X on it, without fear of retribution or considering the (inevitable) internal politics of the discipline and the job-market. I do not know if there is a way to reconcile these, but that's one of the kind of trade-offs we have to consider as we try to re-design this institution. ^

Learned Folly; Kith and Kin; The Collective Use and Evolution of Concepts

Posted by crshalizi at May 21, 2012 02:00 | permanent link

May 03, 2012

Ten Years of Monster Raving Egomania and Utter Batshit Insanity

Sometimes, all you can do is quote verbatim* from your inbox:

Date: Tue, 17 Apr 2012 09:31:57 -0400
From: Stephen Wolfram
To: Cosma Shalizi
Subject: 10-year followup on "A New Kind of Science"

Next month it'll be 10 years since I published "A New Kind of Science"
... and I'm planning to take stock of the decade of commentary, feedback and
follow-on work about the book that's appeared.

My archives show that you wrote an early review of the book:
http://www.cscs.umich.edu/~crshalizi/reviews/wolfram/

At the time reviews like yours appeared, most of the modern web apparatus
for response and public discussion had not yet developed.  But now it has,
and there seems to be considerable interest in the community in me using
that venue to give my responses and comments to early reviews.

I'm writing to ask if there's more you'd like to add before I embark on my
analysis in the next week or so.

I'd like to take this opportunity to thank you for the work you put into
writing a review of my book.  I know it was a challenge to review a book of
its size, especially quickly.  I plan to read all reviews with forbearance,
and hope that---especially leavened by the passage of a decade---useful
intellectual points can be derived from discussing them.

If you don't have anything to add to your early review, it'd be very helpful
to know that as soon as possible.

Thanks in advance for your help.

-- Stephen Wolfram

P.S. Nowadays you can find the whole book online at
http://www.wolframscience.com/nksonline/toc.html  If you'd like a new
physical copy, just let me know and I can have it shipped...


I wrote my my review in 2002 (though I didn't put it out until 2005). The idea that complex patterns can arise from simple rules was already old then, and has only become more commonplace since. A lot of interesting, substantive, specific science has been done on that theme in the ensuing decade. To this effort, neither Wolfram nor his book have contributed anything of any note. The one respect in which I was overly pessimistic is that I have not, in fact, had to spend much time "de-programming students [who] read A New Kind of Science before knowing any better" — but I get a rather different class of students these days than I did in 2002.

Otherwise, and for the record, I do indeed still stand behind the review.

Manual trackback: Hacker News; Wolfgang; Andrew Gelman

*: I removed our e-mail addresses, because no one deserves spam.

Self-Centered; Complexity; Psychoceramica

Posted by crshalizi at May 03, 2012 23:10 | permanent link

May 02, 2012

Installing pcalg

Attention conservation notice: Boring details about getting finicky statistical software to work; or, please read the friendly manual.

Some of my students are finding it difficult to install the R package pcalg; I share these instructions in case others are also in difficulty.

  1. For representing graphs, pcalg relies on two packages called RBGL and graph. These are not available on CRAN, but rather are on the other R software repository, BioConductor. To install them, follow the instructions at those links; to summarize, run this:
    source("http://bioconductor.org/biocLite.R")
    biocLite("RBGL")
    (Since RBGL depends on graph, this should automatically also install graph; if not, run biocLite("graph"), then biocLite("RBGL").)
  2. Now install pcalg from CRAN, along with the packages it depends on. You will get a warning about not having the Rgraphviz package. However, you will be able to load pcalg and run it. You should be able to step through the example labeled "Using Gaussian Data" at the end of help(pc), though it will not produce any plots.

    You can still extract the graph by hand from the fitted models returned by functions like pc --- if one of those objects is fit, then fit@graph@edgeL is a list of lists, where each node has its own list, naming the other nodes it has arrows to (not from). If you are doing this for the final in ADA, you don't actually need anything beyond this to do the assignment, as explained in question A1a.

  3. Rgraphviz is what pcalg relies on for drawing pictures of causal graphs. Its installation is somewhat tricky, so there is a README file, which you should read.
    The key point is that Rgraphviz itself relies on a non-R suite of programs called graphviz. You will want to install these. Go to graphviz.org, and download and install the software. (If you use a Mac, the standard download also includes Graphviz.app, which is a nice visual interface to the actual graph-drawing functions, and what I use for drawing the DAGs in the lecture notes.)
  4. You have to make sure that your operating system will let other software (like R) call on graphviz. The way to do this is to add the directory (or folder) where you installed graphviz to the list of places your computer recognizes as containing executable programs --- the system's "command path". The README for installing Rgraphviz explains what you have to add to the path. (If you are a Windows user and do not know how to alter the command path, read this.)
  5. If you have R open, close it. (If you do not, it will probably not know about the new software you've just gotten the system to recognize.) Re-open R, and install Rgraphviz. The basic installation command is just
    source("http://bioconductor.org/biocLite.R")
    biocLite("Rgraphviz")
    The README for Rgraphviz gives some checks which you should be able to run if everything is working; try them.
  6. You should now be able to generate pictures of DAGs with pc and the other functions in pcalg; try stepping through all the examples at the end of help(pc).

When I installed pcalg on my laptop two weeks ago, it was painless, because (1) I already had graphviz, and (2) I knew about BioConductor. (In fact, the R graphical interface on the Mac will switch between installing packages from CRAN and from BioConductor.) To check these instructions, I just now deleted all the packages from my computer and re-installed them, and everything worked; elapsed time, ten minutes, mostly downloading.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at May 02, 2012 21:30 | permanent link

May 01, 2012

Final Exam (Advanced Data Analysis from an Elementary Point of View)

In which we are devoted to two problems of political economy, viz., strikes, and macroeconomic forecasting.

Assignment; macro.csv

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at May 01, 2012 10:31 | permanent link

Time Series I (Advanced Data Analysis from an Elementary Point of View)

What time series are. Properties: autocorrelation or serial correlation; other notions of serial dependence; strong and weak stationarity. The correlation time and the world's simplest ergodic theorem; effective sample size. The meaning of ergodicity: a single increasing long time series becomes representative of the whole process. Conditional probability estimates; Markov models; the meaning of the Markov property. Autoregressive models, especially additive autoregressions; conditional variance estimates. Bootstrapping time series. Trends and de-trending.

Reading: Notes, chapter 26; R for examples; gdp-pc.csv

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at May 01, 2012 10:30 | permanent link

April 30, 2012

Books to Read While the Algae Grow in Your Fur, April 2012

Attention conservation notice: I have no taste.

Susan Whitfield, Life along the Silk Road
Not-quite-historical fiction: life stories of sundry Silk Road characters — merchants, monks, soldiers, artists, ordinary widows — distributed from Samarkand to Chang-an, and from 700 to 900 AD. These are all more or less composites of actual people, glimpsed from the archaeological record, and especially through the manuscripts preserved at Dunhuang and saved/stolen by Aurel Stein. (In fact the whole book owes a great deal to Stein, with a lot of input from Beckwith's The Tibetan Empire in Central Asia.) The lack of references makes it hard to know how much is stitched together from sources and how much is Whitfield's invention, but at the very least it's well-told.
Nathan Long, Jane Carver of Waar
Mind candy. This is at once a parody of, and homage to, Barsoom. Unlike Burroughs, Long's book can be enjoyed after the Golden Age of Science Fiction (i.e., by those over the age of sixteen): his characters are all at least two-dimensional (Jane herself is an engaging narrator, though definitely at the Hill end of the Moby-Hill spectrum), his style is decent, and the plot is actually interesting. I think it would be enjoyable even if you hadn't dosed up on planetary romances as a kid.
James S. A. Corey (i.e., Daniel Abraham and Ty Franck), Leviathan Wakes
Mind candy. Space opera, confined to the solar system a few centuries hence. This has gotten a lot of favorable attention, but I found it merely OK; perhaps I'd have enjoyed it more if my expectations had been lower. It's split between two plot lines, with two point-of-view characters; I enjoyed (but wasn't blown away by) one of them, but found the other both over-predictable and irritating. It does some things well (a reasonably-sized solar system! minimal handwavium! a non-grim-meathook-future future! some decent characterization!), but it never really managed to grab me. It's definitely nowhere near as good as say, McAuley's The Quiet War, to name a recent and thematically-similar book. The sequel will be out soon, and seems like it will be continuing along the better of the two narrative threads here, so I might pick it up, but I won't rush to do so.
Spoiler-laden griping: Bar bs gur gjb cybg yvarf vf n uneq-obvyrq vairfgvtngvba, pbzcyrgr jvgu na nypbubyvp zvqqyr-ntrq qrgrpgvir, pbeehcg vagevthrf, naq n zlfgrevbhf qnzr jub gur qrgrpgvir snyyf va ybir jvgu. V qba'g yvxr gur uneq-obvyrq traer, orpnhfr, juvyr V nz irel fragvzragny, vgf cnegvphyne pbzovangvba bs fragvzragnyvgl naq plavpvfz vf bss-chggvat. Fb onfvpnyyl V jnagrq gb fxvc nyy gur puncgref sebz Zvyyre'f cbvag bs ivrj, naq whfg sbyybj gubfr jvgu Ubyqra naq uvf perj. Yrff crefbanyyl (v.r., nf n engvbanyvmngvba), abve cerfhccbfrf fhpu n irel cnegvphyne, uvfgbevpnyyl-yvzvgrq phygheny frggvat gung frrvat vg fvzcyl qhzcrq vagb jung fubhyq or n enqvpnyyl arj xvaq bs fbpvrgl jnf wneevat. (Rirelguvat ba Prerf jbexf yvxr Puvpntb pvepn 1940 orpnhfr ubj ryfr?).
Ba n qvssrerag cynar nygbtrgure, Cebgbtra'f ernfbaf sbe jnagvat gb gel bhg gur nyvra ivehf/znpuvar ba gur jubyr cbchyngvba bs Rebf ner jrnx. Vs gur cbvag bs gur znpuvar vf gb gnxr bire rkvfgvat ovbznff naq erfuncr vg nppbeqvat gb fbzr cebtenz, vg jbhyq frrz vasvavgryl rnfvre gb tvir vg hzcgrra gbaf bs lrnfg gb cynl jvgu, guna gb fcraq lrnef bepurfgengvat gur gnxr-bire bs n pbybal jvgu bire n zvyyvba crbcyr, gb fnl abguvat bs gur erqhprq cbffvovyvgl sbe oybj-onpx, frphevgl oernpurf, rgp. Ab qbhog gurl'q jnag gb gel vg ba crbcyr riraghnyyl, ohg fgnegvat gurer, jvgu ab pbageby bire rssrpgf, vf whfg onq rkcrevzragny qrfvta. Cyhf "tvir gur napvrag fhcre-nqinaprq nyvra jne znpuvar pbageby bire na nfgrebvq" qbrf abg fbhaq yvxr n cyna juvpu jbhyq qrirybc gb n fbpvbcngu'f nqinagntr. (Gurl jbhyqa'g pner nobhg gur qnzntr gb bguref, ohg gurzfryirf?)
In conclusion, bring me back my cane and then get off my lawn, you're trampling the lilies.
Matthew Johnson, Fall from Earth [buying: publisher, audio]
Mind candy. Scheme-laden first-contact space opera with a social setting I can only call "The Ming Dynasty IN SPAAAAACE". Good enough that I will keep a look out for more from Johnson.
It's a small thing, but Johnson shows no appreciation of the energy required to move food from planet to planet, which makes his "equitable marketing system" a complete non-starter. (But he shares this flaw with Cherryh's deservedly-admired Downbelow Station.) If, however, the magistracy wants to make sure that no world can become self-sufficient, the way to do it would be to restrict their manufacturing, since any colony would be dependent for survival on a complex industrial infrastructure.
Bernard Williams, Truth and Truthfulness: An Essay in Genealogy
Shorter Williams: "Say what you mean. Bear witness. Iterate." (The late John M. Ford, in a different context.)
Slightly longer: You can get a decent sense of what the book is about from the publishers, so I'll comment without much exposition.
When Williams talks about a "genealogy" of some idea or practice, he means an account of why, if it did not exist, we would have to invent it. Specifically, he spins a state-of-nature story about how if, in the state of nature, human beings did not have an idea of truth, but nonetheless were social and rational animals, and so dependent on a division of epistemic labor, they would have to form one, and two "virtues of truthfulness", namely "sincerity" (Ford's "say what you mean") and "accuracy" (Ford's "bear witness") to make it effective. This is not intended as history or pre-history (Williams: "the state of nature is not the Pleistocene"), but it is a bit mysterious to me how then it is supposed to explain our notions of truth, truthfulness, sincerity, accuracy, etc., much less explain them "non-reductively". Perhaps — this is suggested by his section on "Shameful Origins" — it is just supposed to make us feel better about having them, by convincing us that we could have acquired such ideas in a way which doesn't discredit them. (We are not suckers.)
It may sound odd to describe "accuracy" as a virtue, but being accurate --- bearing good witness --- means things like check tendencies to leap to conclusion, choosing appropriate methods of inquiry, taking pains to secure all the relevant facts (Williams is especially good on the notion of "facts"), etc. Williams is indeed eloquent on how the virtues of accuracy are one of the things which have made the pursuit of science a source of human values, especially in circumstances where honesty otherwise was hard.
As this last suggests, culture lets us articulate the raw virtues of sincerity and accuracy into incredibly elaborate and interlocking complexes of attitudes and practices (Ford's "iterate"). From the inside, these have, or at least seem to have, intrinsic as well as instrumental value, and indeed they would not work at if their value was just instrumental. I confess that I do not fully follow Williams's attempt to try to explain when or why or how the virtues of truth become "intrinsic values". It seems to be something like: people find these values compelling, in a way which they would not if they saw them just as handy tools for achieving selfish ends; this in turn makes these values successful commitment devices [1]. Williams seems to me to equivocate as to whether these virtues really do have such intrinsic value, but on balance I am just as happy that he strayed no deeper into the swamp of meta-ethics, and wisely turned back to the sounder terrain of looking at certain episodes in the articulation of these virtues. The two main case-studies he gives are contrasts of Thucydides and Herodotus on history, and of Rousseau and Diderot on authenticity and the self. Both of these really have a wider, philosophical import, and as such they would both have been stronger for a more comparative, cross-cultural perspective — not in the service of the small virtue of courtesy (Williams has mercifully few "what you mean 'we', white man?" moments), but rather in the service of the great virtue of accuracy [2].
But I see that I am descending into my usual quibbling. This is a profoundly thoughtful and profoundly learned book, which says interesting things to say about some of the deepest and most humanly-important problems in philosophy, and says them elegantly. Go read.
[1] I cannot help but be reminded of William James:
Now, why do the various animals do what seem to us such strange things, in the presence of such outlandish stimuli? Why does the hen, for example, submit herself to the tedium of incubating such a fearfully uninteresting set of objects as a nestful of eggs, unless she have some sort of a prophetic inkling of the result? The only answer is ad hominem. We can only interpret the instincts of brutes by what we know of instincts in ourselves. Why do men always lie down, when they can, on soft beds rather than on hard floors? Why do they sit round the stove on a cold day? Why, in a, room, do they place themselves, ninety-nine times out of a hundred, with their faces towards its middle rather than to the wall? Why do they prefer saddle of mutton and champagne to hard-tack and ditch-water? Why does the maiden interest the youth so that everything about her seems more important and significant than anything else in the world? Nothing more can be said than that these are human ways, and that every creature likes its own ways, and takes to the following them as a, matter of course. Science may come and consider these ways, and find that most of them are useful. But it is not for the sake of their utility that they are followed, but because at the moment of following them we feel that that is the only appropriate and natural thing to do. Not one man in a billion, when taking his dinner, ever thinks of utility. He eats because the food tastes good and makes him want more. If you ask him why he should want to eat more of what tastes like that, instead of revering you as a philosopher he will probably laugh at you for a fool. The connection between the savory sensation and the act it awakens is for him absolute and selbstverständlich, an "a priori synthesis" of the most perfect sort, needing no proof but its own evidence. It takes, in short, what Berkeley calls a mind debauched by learning to carry the process of making the natural seem strange, so far as to ask for the why of any instinctive human act. To the metaphysician alone can such questions occur as: Why do we smile, when pleased, and not scowl? Why are we unable to talk to a crowd as we talk to a single friend? Why does a particular maiden turn our wits so upside-down? The common man can only say, "Of course we smile, of course our heart palpitates at the sight of the crowd, of course we love the maiden, that beautiful soul clad in that perfect form, so palpably and flagrantly made from all eternity to be loved!"
And so, probably, does each animal feel about the particular things it tends to do in presence of particular objects. They, too, are a priori syntheses. To the lion it is the lioness which is made to be loved; to the bear, the she-bear. To the broody hen the notion would probably seem monstrous that there should be a creature in the world to whom a nestful of eggs was not the utterly fascinating and precious and never-to-be-too-much-sat-upon object which it is to her.
Thus we may be sure that, however mysterious some animals' instincts may appear to us, our instincts will appear no less mysterious to them. And we may conclude that, to the animal which obeys it, every impulse and every step of every instinct shines with its own sufficient light, end seems at the moment the only eternally right and proper thing to do. It is done for its own sake exclusively. What voluptuous thrill may not shake a fly, when she at last discovers the one particular leaf, or carrion, or bit of dung, that out of all the world can stimulate her ovipositor to its discharge? Does not the discharge then seem to her the only fitting thing? And need she care or know anything about the future maggot and its food?
More soberly, or at least with fewer hens and maggots, this is highly reminiscent of Robert Frank's Passion within Reason, which I do not believe Williams mentions. ^
[2] Williams claims, quite plausibly, that Thucydides had different ideas about historical explanation and historical evidence than did Herodotus — ones which are both stricter about what counts as acceptable history, and which are supported by compelling rationales even within the older framework. He also claims, more sketchily, that Herodotus was immersed in a culture which was still partly oral and partly literature, while Thucydides was not. If all this was right, should not the same contrast show up in the historical traditions of China, the Islamic world, etc.? Why does such a tradition not seem to be indigenous to India? (Cf., on all this, Brown's History, Hierarchy, and Human Nature.) Western Europe, after the fall of the western Roman Empire, never lost literacy, but it certainly didn't produce histories like Thucydides's for many centuries: why, on Williams's account, not? (Actually, outside of Italy, did western Europe ever produce such histories before the fall of the empire?) If there are important distinctions between these cases, such that Williams's account applies only in the special circumstances of the Aegean around 500--300 BC, what are those circumstances? — Let me add that it was Williams who made all these considerations relevant, not me. ^
J. C. W. Rayner and D. J. Best, Smooth Tests of Goodness of Fit
Suppose a random variable \( Y \) is confined to the unit interval \( [0,1] \), and we want to test whether it is uniformly distributed. One way to do this would be to construct alternative distributions which are in some sense smooth departures from uniformity, with densities \( g(y;\theta) = e^{\sum_{j=1}^{d}{\theta_j h_j(y)}}/z(\theta) \), where it is convenient to chose the \( h_j \) functions to be an orthonormal basis --- the cosine basis, say, or the Legendre polynomials. (That is, they are orthonormal in \( L_2 \), the space of square-integrable functions on the unit interval.) Uniformity is then the special case \( \theta = 0 \), and we can test it against the alternative that \( \theta \neq 0 \) by the usual devices of a likelihood-ratio test, a score test, etc., which will all, under the null hypothesis, have an asymptotic \( \chi^2_d \) distribution. This is Neyman's original smooth test, which seems to have originated from the problem of how to combine p-values from independent experiments, which should all be uniformly distributed under the null hypothesis. One nice feature of this test is that if we reject the null, we immediately have an alternative, namely our maximum likelihood estimate of \( \theta \), for what the actual distribution is --- it tells us not just that the null model is wrong, but how, and what a better one would be like.
The real power of this comes from the following observation. If \( X \) is distributed according to some continuous CDF \( F \), then \( Y=F(X) \) is uniformly distributed on \( [0,1] \). The smooth alternatives for \( Y \) translate into smooth alternatives for \( X \), with densities \( g_X(x,\theta) = f(x) e^{\sum_{j=1}^{d}{\theta_j h_j(F(x))}}/z(\theta) \). We can test whether \( X \sim F \) by, once again, testing with \( \theta = 0 \), and the theory works just as before. If \( F \) is not fixed but involves some parameters \( \beta \), then we consider the smooth alternative densities \( g_{X}(x;\beta,\theta) = f(x;\beta) e^{\sum_{j=1}^{d}{\theta_j h_j(F(x;\beta))}}/z(\theta) \), and again we test the specification by testing \( \theta = 0 \). Since this always involves fixing \( d \) parameters, we always get a \( \chi^2_d \) asymptotic distribution under the null.
Rayner and Best's monograph is a clear, if now somewhat old-fashioned, exposition of Neyman's smooth test and its relatives and extensions. They actually begin with Pearson's \( X^2 \) or \( \chi^2 \) test, which can be seen as a smooth test for multinomial (rather than continuous) data, before going on to consider the general theory of likelihood ratio and score tests, and Neyman's smooth tests. Much of the book is taken up with various permutations of discretizing continuous variables and/or allowing estimation of the parameters I have written \( \beta \); the latter concern seems less important these days.
An important set of developments which does not get as much attention here as a more recent treatment would give is that of picking the order of the alternatives \( d \). Neyman suggested \( d = 4 \) but emphasized it was guess; some later workers guessed \( d = 2 \) should be enough. Really, however, this is a problem of model selection or capacity control, and so all the usual tools, like cross-validation or information criteria, can be applied. This is one place where BIC has proved particularly useful, leading to "data-driven" smooth tests. These no longer have nice \( \chi^2 \) asymptotics, but it's pretty easy to get their sampling distributions from simulation.
Despite these limits, this is still a useful reference for people interested in specification checking.
Aliette De Bodard, Servant of the Underworld
Mind candy: historical fantasy/mystery set in Tenochtitlan (a few generations before what would be the Conquest), only with the mythology of the Aztecs being literally true and magic very much a part of actual life. It had some typical first-novel flaws (too much exposition, the plot drags in places), but overall decent.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Enigmas of Chance; Central Asia; Philosophy

Posted by crshalizi at April 30, 2012 23:59 | permanent link

April 24, 2012

Brought to You by the Letters D, A, and G (Advanced Data Analysis from an Elementary Point of View)

In which the arts of estimating causal effects from observational data are practiced on Sesame Street.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 24, 2012 10:31 | permanent link

Estimating Causal Effects from Observations (Advanced Data Analysis from an Elementary Point of View)

Estimating graphical models: substituting consistent estimators into the formulas for front and back door identification; average effects and regression; tricks to avoid estimating marginal distributions; propensity scores and matching and propensity scores as computational short-cuts in back-door adjustment. Instrumental variables estimation: the Wald estimator, two-stage least-squares. Summary recommendations for estimating causal effects.

Reading: Notes, chapter 24

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 24, 2012 10:30 | permanent link

April 22, 2012

Separated at Birth (Advanced Data Analysis from an Elementary Point of View)

In which we use graphical causal models to understand twin studies and variance components.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 22, 2012 12:00 | permanent link

April 21, 2012

Identifying Causal Effects from Observations (Advanced Data Analysis from an Elementary Point of View)

Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Summary recommendations for identifying causal effects.

Reading: Notes, chapter 23

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 21, 2012 12:00 | permanent link

April 20, 2012

Just How Quickly Do We Forget?

Attention conservation notice: 2500+ words on estimating how quickly time series forget their own history. Only of interest if you care about the intersection of stochastic processes and statistical learning theory. Full of jargon, equations, log-rolling and self-promotion, yet utterly abstract.

I promised to say something about the content of Daniel's thesis, so let me talk about two of his papers, which go into chapter 4; there is a short conference version and a long journal version.

Daniel J. McDonald, Cosma Rohilla Shalizi and Mark Schervish, "Estimating beta-mixing coefficients", AIStats 2011, arxiv:1103.0941
Abstract: The literature on statistical learning for time series assumes the asymptotic independence or "mixing" of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing rates from data. We give an estimator for the \( \beta \)-mixing rate based on a single stationary sample path and show it is \( L_1 \)-risk consistent.
----, "Estimating beta-mixing coefficients via histograms", arxiv:1109.5998
Abstract: The literature on statistical learning for time series often assumes asymptotic independence or "mixing" of data sources. Beta-mixing has long been important in establishing the central limit theorem and invariance principle for stochastic processes; recent work has identified it as crucial to extending results from empirical processes and statistical learning theory to dependent data, with quantitative risk bounds involving the actual beta coefficients. There is, however, presently no way to actually estimate those coefficients from data; while general functional forms are known for some common classes of processes (Markov processes, ARMA models, etc.), specific coefficients are generally beyond calculation. We present an \( L_1 \)-risk consistent estimator for the beta-mixing coefficients, based on a single stationary sample path. Since mixing coefficients involve infinite-order dependence, we use an order-d Markov approximation. We prove high-probability concentration results for the Markov approximation and show that as \( d \rightarrow \infty \), the Markov approximation converges to the true mixing coefficient. Our estimator is constructed using d dimensional histogram density estimates. Allowing asymptotics in the bandwidth as well as the dimension, we prove \( L_1 \) concentration for the histogram as an intermediate step.

Recall the world's simplest ergodic theorem: if \( X_t \) is a sequence of random variables with common expectation \( m \) and variance \( v \), and stationary covariance \( \mathrm{Cov}[X_t, X_{t+h}] = c_h \). Then the time average \( \overline{X}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{X_i} \) also has expectation \( m \), and the question is whether it converges on that expectation. The world's simplest ergodic theorem asserts that if the correlation time \[ T = \frac{\sum_{h=1}^{\infty}{|c_h|}}{v} < \infty \] then \[ \mathrm{Var}\left[ \overline{X}_n \right] \leq \frac{v}{n}(1+2T) \]

Since, as I said, the expectation of \( \overline{X}_n \) is \( m \) and its variance is going to zero, we say that \( \overline{X}_n \rightarrow m \) "in mean square".

From this, we can get a crude but often effective deviation inequality, using Chebyshev's inequality: \[ \Pr{\left(|\overline{X}_n - m| > \epsilon\right)} \leq \frac{v}{\epsilon^2}\frac{1+2T}{n} \]

The meaning of the condition that the correlation time \( T \) be finite is that the correlations themselves have to trail off as we consider events which are widely separated in time — they don't ever have to be zero, but they do need to get smaller and smaller as the separation \( h \) grows. (One can actually weaken the requirement on the covariance function to just \( \lim_{n\rightarrow \infty}{\frac{1}{n}\sum_{h=1}^{n}{c_h}} = 0 \), but this would take us too far afield.) In fact, as these formulas show, the convergence looks just like what we'd see for independent data, only with \( \frac{n}{1+2T} \) samples instead of \( n \), so we call the former the effective sample size.

All of this is about the convergence of averages of \( X_t \), and based on its covariance function \( c_h \). What if we care not about \( X \) but about \( f(X) \)? The same idea would apply, but unless \( f \) is linear, we can't easily get its covariance function from \( c_h \). The mathematicians' solution to this has been to invent stronger notions of decay-of-correlations, called "mixing". Very roughly speaking, we say that \( X \) is mixing when, if you pick any two (nice) functions \( f \) and \( g \), I can always show that \[ \lim_{h\rightarrow\infty}{\mathrm{Cov}\left[ f(X_t), g(X_{t+h}) \right]} = 0 \]

Note (or believe) that this is "convergence in distribution"; it happens if, and only if, the distribution of events up to time \( t \) is becoming independent of the distribution of events from time \( t+h \) onwards.

To get useful results, it is necessary to quantify mixing, which is usually done through somewhat stronger notions of dependence. (Unfortunately, none of these have meaningful names. The review by Bradley ought to be the standard reference.) For instance, the "total variation" or \( L_1 \) distance between probability measures \( P \) and \( Q \), with densities \( p \) and \( q \) is, \[ d_{TV}(P,Q) = \frac{1}{2}\int{|p(u) - q(u)| du} \] This has several interpretations, but the easiest to grasp is that it says how much \( P \) and \( Q \) can differ in the probability they give to any one event: for any \( E \), \( d_{TV}(P,Q) \geq |P(E) - Q(E)| \). One use of this distance is to measure how the dependence between random variables, by seeing far their joint distribution is from the product of their marginal distributions. Abusing notation a little to write \( P(U,V) \) for the joint distribution of \( U \) and \( V \), we measure dependence as \[ \beta(U,V) \equiv d_{TV}(P(U,V), P(U) \otimes P(V)) = \frac{1}{2}\int{|p(u,v)-p(u)p(v)|du dv} \] This will be zero just when \( U \) and \( V \) are statistically independent, and one when, on average, conditioning on \( U \) confines \( V \) to a set which would otherwise have probability zero. (For instance if \( U \) has a continuous distribution and \( V \) is a function of \( U \) — or one of two randomly chosen functions of \( U \).)

We can relate this back to the earlier idea of correlations between functions by realizing that \[ \beta(U,V) = \sup_{|r|\leq 1}{\left|\int{r(u,v) dP(U,V)} - \int{r(u,v)dP(U)dP(V)}\right|} ~, \] that \( \beta \) says how much the expected value of a bounded function \( r \) could change between the dependent and the independent distributions. (There is no assumption that the test function \( r \) factorizes, and in fact it's important to allow \( r(u,v) \neq f(u)g(v) \).)

We apply these ideas to time series by looking at the dependence between the past and the future: \[ \begin{eqnarray*} \beta(h) & \equiv & d_{TV}(P(X^t_{-\infty}, X_{t+h}^{\infty}), P(X^t_{-\infty}) \otimes P(X_{t+h}^{\infty})) \\ & = & \frac{1}{2}\int{|p(x^t_{-\infty},x_{t+h}^{\infty})-p(x^t_{-\infty})p(x^{\infty}_{t+h})|dx^t_{-\infty}dx^{\infty}_{t+h}} \end{eqnarray*} \] (By stationarity, the integral actually does not depend on \( t \).) When \( \beta(h) \rightarrow 0 \) as \( h \rightarrow \infty \), we have a "beta-mixing" process. (These are also called "absolutely regular".) Convergence in total variation implies convergence in distribution, but not vice versa, so beta-mixing is stronger than common-or-garden mixing.

Notions like beta-mixing were originally introduced purely for probabilistic convenience, to handle questions like "when does the central limit theorem hold for stochastic processes?" These are interesting for people who like stochastic processes, or indeed for those who want to do Markov chain Monte Carlo and want to know how long to let the chain run. For our purposes, though, what's important is that when people in statistical learning theory have given serious attention to dependent data, they have usually relied on a beta-mixing assumption.

The reason for this focus on beta-mixing is that it "plays nicely" with approximating dependent processes by independent ones. The usual form of such arguments is as follows. We want to prove a result about our dependent but mixing process \( X \). For instance, we realize that our favorite prediction model will tend to do worse out-of-sample than on the data used to fit it, and we might want to bound the probability that this over-fitting will exceed \( \epsilon \). If we know the beta-mixing coefficients \( \beta(h) \), we can pick a separation, call it \( a \), where \( \beta(a) \) is reasonably small. Now we divide \( X \) up into \( \mu = n/a \) blocks of length \( a \). If we take every other block, they're nearly independent of each other (because \( \beta(a) \) is small) but not quite (because \( \beta(a) \neq 0 \)). Introduce a (fictitious) random sequence \( Y \), where blocks of length \( a \) have the same distribution as the blocks in \( X \), but there's no dependence between blocks. Since \( Y \) is an IID process, it is easy for us to prove that, for instance, the probability of over-fitting \( Y \) by more than \( \epsilon \) is at most some small \( \delta(\epsilon,\mu/2) \). Since \( \beta \) tells us about how well dependent probabilities are approximated by independent ones, the probability of the bad event happening with the dependent data is at most \( \delta(\epsilon,\mu/2) + (\mu/2)\beta(a) \). We can make this as small as we like by letting \( \mu \) and \( a \) both grow as the time series gets longer. Basically, anything result which holds for an IID process will also hold for a beta-mixing one, with a penalty in the probability that depends on \( \beta \). There are some details to fill in here (how to pick the separation \( a \)? should the blocks always be the same length as the "filler" between blocks?), but this is the basic frame.

What it leaves open, however, is how to estimate the mixing coefficients \( \beta(h) \). For Markov models, one could it principle calculate it from the transition probabilities. For more general processes, though, calculating beta from the known distribution is not easy. In fact, we are not aware of any previous work on estimating the \( \beta(h) \) coefficients from observational data. (References welcome!) Because of this, even in learning theory, people have just assumed that the mixing coefficients were known, or that it was known they went to zero at a certain rate. This was not enough for what we wanted to do, which was actually calculate bounds on error from data.

There were two tricks to actually coming up with an estimator. The first was to reduce the ambitions a little bit. If you look at the equation for \( \beta(h) \) above, you'll see that it involves integrating over the infinite-dimensional distribution. This is daunting, so instead of looking at the whole past and future, we'll introduce a horizon, \( d \) steps away, and cut things off there: \[ \begin{eqnarray*} \beta^{(d)}(h) & \equiv & d_{TV}(P(X^t_{t-d}, X_{t+h}^{t+h+d}), P(X^t_{t-d}) \otimes P(X_{t+h}^{t+h+d})) \\ & = & \frac{1}{2}\int{|p(x^t_{t-d},x_{t+h}^{t+h+d})-p(x^t_{t-d})p(x^{t+h+d}_{t+h})|dx^t_{t-d}dx^{t+h+d}_{t+h}} \end{eqnarray*} \] If \( X \) is a Markov process, then there's no difference between \( \beta^{(d)}(h) \) and \( \beta(h) \). If \( X \) is a Markov process of order \( p \), then \( \beta^{(d)}(h) = \beta(h) \) once \( d \geq p \). If \( X \) is not Markov at any order, it is still the case that \( \beta^{(d)}(h) \rightarrow \beta(h) \) as \( d \) grows. So we have an approximation to \( \beta \) which only involves finite-dimensional integrals, which we might have some hope of doing.

The other trick is to get rid of those integrals. Another way of writing the beta-dependence between the random variables \( U \) and \( V \) is \[ \beta(U,V) = \sup_{\mathcal{A},\mathcal{B}}{\frac{1}{2}\sum_{a\in\mathcal{A}}{\sum_{b\in\mathcal{B}}{\left| \Pr{(a \cap b)} - \Pr{(a)}\Pr{(b)} \right|}}} \] where \( \mathcal{A} \) runs over finite partitions of values of \( U \), and \( \mathcal{B} \) likewise runs over finite partitions of values of \( V \). I won't try to show that this formula is equivalent to the earlier definition, but I will contend that if you think about how that integral gets cashed out as a sum, you can sort of see how it would be. If we want \( \beta^{(d)}(h) \), we can take \( U = X^{t}_{t-d} \) and \( V = X^{t+h+d}_{t+h} \), and we could find the dependence by taking the supremum over partitions of those two variables.

Now, suppose that the joint density \( p(x^t_{t-d},x_{t+h}^{t+h+d}) \) was piecewise constant, with those pieces being rectangles parallel to the coordinate axes. Then sub-dividing those rectangles would not change the sum, and the \( \sup \) would actually be attained for that particular partition. Most densities are not of course piecewise constant, but we can approximate them by such piecewise-constant functions, and make the approximation arbitrarily close (in total variation). More, we can estimate those piecewise-constant approximating densities from a time series. Those estimates are, simply, histograms, which are about the oldest form of density estimation. We show that histogram density estimates converge in total variation on the true densities, when the bin-width is allowed to shrink as we get more data.

Because the total variation distance is in fact a metric, we can use the triangle inequality to get an upper bound on the true beta coefficient, in terms of the beta coefficients of the estimated histograms, and the expected error of the histogram estimates. All of the error terms shrink to zero as the time series gets longer, so we end up with consistent estimates of \( \beta^{(d)}(h) \). That's enough if we have a Markov process, but in general we don't. So we can let \( d \) grow as \( n \) does, and that (after a surprisingly long measure-theoretic argument) turns out to do the job: our histogram estimates of \( \beta^{(d)}(h) \), with suitably-growing \( d \), converge on the true \( \beta(h) \).

To confirm that this works, the papers go through some simulation examples, where it's possible to cross-check our estimates. We can of course also do this for empirical time series. For instance, in his this Daniel took four standard macroeconomic time series for the US (GDP, consumption, investment, and hours worked, all de-trended in the usual way). This data goes back to 1948, and is measured four times a year, so there are 255 quarterly observations. Daniel estimated a \( \beta \) of 0.26 at one quarter's separation, \( \widehat{\beta}(2) = 0.15 \), \( \widehat{\beta}(3) = 0.02 \), and somewhere between 0 and 0.11 for \(\widehat{\beta}(4) \). (That last is a sign that we don't have enough data to go beyond \( h = 4 \).) Optimistically assuming no dependence beyond a year, one can calculate the effective number of independent data points, which is not 255 but 31. This has morals for macroeconomics which are worth dwelling on, but that will have to wait for another time. (Spoiler: \( \sqrt{\frac{1}{31}} \approx 0.18 \), and that's if you're lucky.)

It's inelegant to have to construct histograms when all we want is a single number, so it wouldn't surprise us if there were a slicker way of doing this. (For estimating mutual information, which is in many ways analogous, estimating the joint distribution as an intermediate step is neither necessary nor desirable.) But for now, we can do it, when we couldn't before.

Enigmas of Chance; Kith and Kin; Self-Centered

Posted by crshalizi at April 20, 2012 14:57 | permanent link

April 15, 2012

Graphical Causal Models (Advanced Data Analysis from an Elementary Point of View)

Probabilistic prediction is about passively selecting a sub-ensemble, leaving all the mechanisms in place, and seeing what turns up after applying that filter. Causal prediction is about actively producing a new ensemble, and seeing what would happen if something were to change ("counterfactuals"). Graphical causal models are a way of reasoning about causal prediction; their algebraic counterparts are structural equation models (generally nonlinear and non-Gaussian). The causal Markov property. Faithfulness. Performing causal prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules for linear models.

Reading: Notes, chapter 22

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 15, 2012 20:03 | permanent link

Exam: Is This Test Really Necessary? (Advanced Data Analysis from an Elementary Point of View)

In which the analysis of multivariate data is recursively applied.

Reading: Notes, assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 15, 2012 20:02 | permanent link

Graphical Models (Advanced Data Analysis from an Elementary Point of View)

Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth?

Reading: Notes, chapter 21

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 15, 2012 20:01 | permanent link

Mixture Models (Advanced Data Analysis from an Elementary Point of View)

From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: planes again. Probabilistic clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.

Extended example: Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components.

Reading: Notes, chapter 20; mixture-examples.R

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 15, 2012 20:00 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems