May 21, 2012

If Peer Review Did Not Exist, We Would Have to Invent Something Very Like It to Serve Highly Similar Ends

Attention conservation notice: 1400 words on a friend's proposal to do away with peer review, written many weeks ago when there was actually some debate about this.

Larry is writing about peer review (again), this time to advocate "A World Without Referees". Every scientist, of course, has day-dreamed about this, in a first-lets-kill-all-the-lawyers way, but Larry is serious, so let's treat this seriously. I'm not going to summarize his argument; it's short and you can and should go read it yourself.

I think it helps, when thinking about this, to separate two functions peer-reviewed journals and conferences have traditionally served. One is spreading claims (dissemination), and the other is letting readers know about claims worthy of their attention (certification).

Arxiv, or something like it, can take over dissemination handily. Making copies of papers is now very cheap and very fast, so we no longer have to be choosy about which ones we disseminate. In physics, this use of Arxiv is just as well-established as Larry says. In fact, one reason Arxiv was able to establish itself so rapidly and thoroughly among physicists was that they already had a well-entrenched culture of circulating preprints long before journal publication. What Arxiv did was make this public and universally accessible.

But physicists still rely on journals for certification. People pay more attention to papers which come out in Physical Review Letters, or even just Physical Review E, than ones which are only on Arxiv. "Could it make it past peer review?" is used by many people as a filter to weed out the stuff which is too wrong or too unimportant to bother with. This doesn't work so well for those directly concerned with a particular research topic, but if something is only peripherally of interest, it makes a lot of sense.

Even within a specialized research community, consisting entirely of experts who can evaluate new contributions on their own, there is a rankling inefficiency to the world without referees. Larry talks about spending a minute or two looking at new stats. papers on Arxiv every day. But everyone filtering Arxiv for themselves is going to get harder and harder as more potentially-relevant stuff gets put on it. I'm interested in information theory, so I've long looked at cs.IT, and it's become notably more time-consuming as that community has embraced the Arxiv. Yet within any given epistemic community, lots of people are going to be applying very similar filters. So the world-without-referees has an increasing amount o work being done by individuals, but a lot of that work is redundant. Efficiency, the division of labor, points to having a few people put their time into filtering, and the rest of us relying on it, even when in principle we could do the filtering ourselves. To be fair, of course, we should probably take this job in turns...

So: if all papers get put on Arxiv, filtering becomes a big job, so efficiency pushes us towards having only some members of the research community do the filtering for the rest. We have re-invented something very much like peer review, purely so that our lives are not completely consumed by evaluating new papers, and we can actually get some work done.

Larry's proposal for a world without referees also doesn't seem to take into account the needs of researchers to rely on findings in fields in which they are not experts, and so can't act as their own filters. (Or they could if they put in a few years in something else first.) If I need some result from neuroscience, or for that matter from topology, I do not have the time to spend becoming a neuroscientist or topologist, and it is an immense benefit to have institutions I can trust to tell me "these claims about cortical columns, or locally compact Hausdorff spaces, are at least not crazy". This is also a kind of filtering, and there is the same push, based on the division of labor, to rely on only some neuroscientists or topologists to do the filtering for outsiders (or all of them only some of the time), and again we have re-created something very much like refereeing.

So: some form or forms of filtering is inevitable, and the forces pushing for a division of labor in filtering are very strong. I don't know of any reason to think that the current, historically-evolved peer review system is the best way of organizing this cognitive triage, but we're not going to avoid having some such system, nor should we want to. Different ways of organizing the work of filtering will have different costs and benefits, but we should be talking about those and those trade-offs, not hoping that we can just wish the problem away now that making copies is cheap1. It's not at all obvious, for instance, that attention-filtering for the internal benefit of members of a research community should be done in the same way as reliability-filtering for outsiders. But, to repeat, we are going to have filters and they are almost certainly going to involve a division of labor.

Lenin, supposedly, said that "small production engenders capitalism and the bourgeoisie daily, hourly, spontaneously and on a mass scale" (Nove, The Economics of Feasible Socialism Revisited, p. 46). Whether he was right about the bourgeoisie or not, the rate of production of the scientific literature, the similarity of interests and standards with a community, and the need to rely on other field's findings are all doing to engender refereeing-like institutions, "daily, hourly, spontaneously and on a mass scale". I don't think Larry would go to the same lengths to get rid of referees that Lenin went to get rid of the bourgeoisie, but in any case the truly progressive course is not to suppress the old system by force, but to provide a superior alternative.

Speaking personally, I am attracted to a scenario we might call "peer review among consenting adults". Let anyone put anything on Arxiv (modulo the usual crank-screen). But then let others create filtered versions, applying such standards of topic, rigor, applicability, writing quality, etc., as they please --- and be explicit about what those standards are. These can be layered as deep as their audience can support. Presumably the later filters would be intended for those further from active research in the area, and so would be less tolerant of false alarms, and more tolerant of missing possible discoveries, than the filters for those close to the work. But this could be an area for experiment, and for seeing what people actually find useful. This is, I take it, more or less what Paul Ginsparg proposes, and it has a lot to recommend it. Every contribution is available if anyone wants to read it, but no one is compelled to try to filter the whole flow of the scholarly literature unaided, and human intelligence can still be used to amplify interesting signals, or even to improve papers.

Attractive as I find this idea, I am not saying it is historically inevitable, or even the best possible way of ordering these matters. The main point is that peer review does some very important jobs for the community of inquirers (whether or not it evolved to do them), and that if we want to get rid of it, it would be a good idea to have something else ready to do those jobs.

[1]: For instance, many people have suggested that referees should have to take responsibility, in some way, for their reports, so that those who do sloppy or ignorant or merely-partisan work will be at least shamed. There is genuinely a lot to be said for this. But it does run into the conflicting demand that science should not be a respecter of persons --- if Grand Poo-Bah X writes a crappy paper, people should be able to call X on it, without fear of retribution or considering the (inevitable) internal politics of the discipline and the job-market. I do not know if there is a way to reconcile these, but that's one of the kind of trade-offs we have to consider as we try to re-design this institution. ^

Learned Folly; Kith and Kin; The Collective Use and Evolution of Concepts

Posted by crshalizi at May 21, 2012 02:00 | permanent link

May 03, 2012

Ten Years of Monster Raving Egomania and Utter Batshit Insanity

Sometimes, all you can do is quote verbatim* from your inbox:

Date: Tue, 17 Apr 2012 09:31:57 -0400
From: Stephen Wolfram
To: Cosma Shalizi
Subject: 10-year followup on "A New Kind of Science"

Next month it'll be 10 years since I published "A New Kind of Science"
... and I'm planning to take stock of the decade of commentary, feedback and
follow-on work about the book that's appeared.

My archives show that you wrote an early review of the book:
http://www.cscs.umich.edu/~crshalizi/reviews/wolfram/

At the time reviews like yours appeared, most of the modern web apparatus
for response and public discussion had not yet developed.  But now it has,
and there seems to be considerable interest in the community in me using
that venue to give my responses and comments to early reviews.

I'm writing to ask if there's more you'd like to add before I embark on my
analysis in the next week or so.

I'd like to take this opportunity to thank you for the work you put into
writing a review of my book.  I know it was a challenge to review a book of
its size, especially quickly.  I plan to read all reviews with forbearance,
and hope that---especially leavened by the passage of a decade---useful
intellectual points can be derived from discussing them.

If you don't have anything to add to your early review, it'd be very helpful
to know that as soon as possible.

Thanks in advance for your help.

-- Stephen Wolfram

P.S. Nowadays you can find the whole book online at
http://www.wolframscience.com/nksonline/toc.html  If you'd like a new
physical copy, just let me know and I can have it shipped...


I wrote my my review in 2002 (though I didn't put it out until 2005). The idea that complex patterns can arise from simple rules was already old then, and has only become more commonplace since. A lot of interesting, substantive, specific science has been done on that theme in the ensuing decade. To this effort, neither Wolfram nor his book have contributed anything of any note. The one respect in which I was overly pessimistic is that I have not, in fact, had to spend much time "de-programming students [who] read A New Kind of Science before knowing any better" — but I get a rather different class of students these days than I did in 2002.

Otherwise, and for the record, I do indeed still stand behind the review.

Manual trackback: Hacker News; Wolfgang; Andrew Gelman

*: I removed our e-mail addresses, because no one deserves spam.

Self-Centered; Complexity; Psychoceramica

Posted by crshalizi at May 03, 2012 23:10 | permanent link

May 02, 2012

Installing pcalg

Attention conservation notice: Boring details about getting finicky statistical software to work; or, please read the friendly manual.

Some of my students are finding it difficult to install the R package pcalg; I share these instructions in case others are also in difficulty.

  1. For representing graphs, pcalg relies on two packages called RBGL and graph. These are not available on CRAN, but rather are on the other R software repository, BioConductor. To install them, follow the instructions at those links; to summarize, run this:
    source("http://bioconductor.org/biocLite.R")
    biocLite("RBGL")
    (Since RBGL depends on graph, this should automatically also install graph; if not, run biocLite("graph"), then biocLite("RBGL").)
  2. Now install pcalg from CRAN, along with the packages it depends on. You will get a warning about not having the Rgraphviz package. However, you will be able to load pcalg and run it. You should be able to step through the example labeled "Using Gaussian Data" at the end of help(pc), though it will not produce any plots.

    You can still extract the graph by hand from the fitted models returned by functions like pc --- if one of those objects is fit, then fit@graph@edgeL is a list of lists, where each node has its own list, naming the other nodes it has arrows to (not from). If you are doing this for the final in ADA, you don't actually need anything beyond this to do the assignment, as explained in question A1a.

  3. Rgraphviz is what pcalg relies on for drawing pictures of causal graphs. Its installation is somewhat tricky, so there is a README file, which you should read.
    The key point is that Rgraphviz itself relies on a non-R suite of programs called graphviz. You will want to install these. Go to graphviz.org, and download and install the software. (If you use a Mac, the standard download also includes Graphviz.app, which is a nice visual interface to the actual graph-drawing functions, and what I use for drawing the DAGs in the lecture notes.)
  4. You have to make sure that your operating system will let other software (like R) call on graphviz. The way to do this is to add the directory (or folder) where you installed graphviz to the list of places your computer recognizes as containing executable programs --- the system's "command path". The README for installing Rgraphviz explains what you have to add to the path. (If you are a Windows user and do not know how to alter the command path, read this.)
  5. If you have R open, close it. (If you do not, it will probably not know about the new software you've just gotten the system to recognize.) Re-open R, and install Rgraphviz. The basic installation command is just
    source("http://bioconductor.org/biocLite.R")
    biocLite("Rgraphviz")
    The README for Rgraphviz gives some checks which you should be able to run if everything is working; try them.
  6. You should now be able to generate pictures of DAGs with pc and the other functions in pcalg; try stepping through all the examples at the end of help(pc).

When I installed pcalg on my laptop two weeks ago, it was painless, because (1) I already had graphviz, and (2) I knew about BioConductor. (In fact, the R graphical interface on the Mac will switch between installing packages from CRAN and from BioConductor.) To check these instructions, I just now deleted all the packages from my computer and re-installed them, and everything worked; elapsed time, ten minutes, mostly downloading.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at May 02, 2012 21:30 | permanent link

May 01, 2012

Final Exam (Advanced Data Analysis from an Elementary Point of View)

In which we are devoted to two problems of political economy, viz., strikes, and macroeconomic forecasting.

Assignment; macro.csv

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at May 01, 2012 10:31 | permanent link

Time Series I (Advanced Data Analysis from an Elementary Point of View)

What time series are. Properties: autocorrelation or serial correlation; other notions of serial dependence; strong and weak stationarity. The correlation time and the world's simplest ergodic theorem; effective sample size. The meaning of ergodicity: a single increasing long time series becomes representative of the whole process. Conditional probability estimates; Markov models; the meaning of the Markov property. Autoregressive models, especially additive autoregressions; conditional variance estimates. Bootstrapping time series. Trends and de-trending.

Reading: Notes, chapter 26; R for examples; gdp-pc.csv

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at May 01, 2012 10:30 | permanent link

April 30, 2012

Books to Read While the Algae Grow in Your Fur, April 2012

Attention conservation notice: I have no taste.

Susan Whitfield, Life along the Silk Road
Not-quite-historical fiction: life stories of sundry Silk Road characters — merchants, monks, soldiers, artists, ordinary widows — distributed from Samarkand to Chang-an, and from 700 to 900 AD. These are all more or less composites of actual people, glimpsed from the archaeological record, and especially through the manuscripts preserved at Dunhuang and saved/stolen by Aurel Stein. (In fact the whole book owes a great deal to Stein, with a lot of input from Beckwith's The Tibetan Empire in Central Asia.) The lack of references makes it hard to know how much is stitched together from sources and how much is Whitfield's invention, but at the very least it's well-told.
Nathan Long, Jane Carver of Waar
Mind candy. This is at once a parody of, and homage to, Barsoom. Unlike Burroughs, Long's book can be enjoyed after the Golden Age of Science Fiction (i.e., by those over the age of sixteen): his characters are all at least two-dimensional (Jane herself is an engaging narrator, though definitely at the Hill end of the Moby-Hill spectrum), his style is decent, and the plot is actually interesting. I think it would be enjoyable even if you hadn't dosed up on planetary romances as a kid.
James S. A. Corey (i.e., Daniel Abraham and Ty Franck), Leviathan Wakes
Mind candy. Space opera, confined to the solar system a few centuries hence. This has gotten a lot of favorable attention, but I found it merely OK; perhaps I'd have enjoyed it more if my expectations had been lower. It's split between two plot lines, with two point-of-view characters; I enjoyed (but wasn't blown away by) one of them, but found the other both over-predictable and irritating. It does some things well (a reasonably-sized solar system! minimal handwavium! a non-grim-meathook-future future! some decent characterization!), but it never really managed to grab me. It's definitely nowhere near as good as say, McAuley's The Quiet War, to name a recent and thematically-similar book. The sequel will be out soon, and seems like it will be continuing along the better of the two narrative threads here, so I might pick it up, but I won't rush to do so.
Spoiler-laden griping: Bar bs gur gjb cybg yvarf vf n uneq-obvyrq vairfgvtngvba, pbzcyrgr jvgu na nypbubyvp zvqqyr-ntrq qrgrpgvir, pbeehcg vagevthrf, naq n zlfgrevbhf qnzr jub gur qrgrpgvir snyyf va ybir jvgu. V qba'g yvxr gur uneq-obvyrq traer, orpnhfr, juvyr V nz irel fragvzragny, vgf cnegvphyne pbzovangvba bs fragvzragnyvgl naq plavpvfz vf bss-chggvat. Fb onfvpnyyl V jnagrq gb fxvc nyy gur puncgref sebz Zvyyre'f cbvag bs ivrj, naq whfg sbyybj gubfr jvgu Ubyqra naq uvf perj. Yrff crefbanyyl (v.r., nf n engvbanyvmngvba), abve cerfhccbfrf fhpu n irel cnegvphyne, uvfgbevpnyyl-yvzvgrq phygheny frggvat gung frrvat vg fvzcyl qhzcrq vagb jung fubhyq or n enqvpnyyl arj xvaq bs fbpvrgl jnf wneevat. (Rirelguvat ba Prerf jbexf yvxr Puvpntb pvepn 1940 orpnhfr ubj ryfr?).
Ba n qvssrerag cynar nygbtrgure, Cebgbtra'f ernfbaf sbe jnagvat gb gel bhg gur nyvra ivehf/znpuvar ba gur jubyr cbchyngvba bs Rebf ner jrnx. Vs gur cbvag bs gur znpuvar vf gb gnxr bire rkvfgvat ovbznff naq erfuncr vg nppbeqvat gb fbzr cebtenz, vg jbhyq frrz vasvavgryl rnfvre gb tvir vg hzcgrra gbaf bs lrnfg gb cynl jvgu, guna gb fcraq lrnef bepurfgengvat gur gnxr-bire bs n pbybal jvgu bire n zvyyvba crbcyr, gb fnl abguvat bs gur erqhprq cbffvovyvgl sbe oybj-onpx, frphevgl oernpurf, rgp. Ab qbhog gurl'q jnag gb gel vg ba crbcyr riraghnyyl, ohg fgnegvat gurer, jvgu ab pbageby bire rssrpgf, vf whfg onq rkcrevzragny qrfvta. Cyhf "tvir gur napvrag fhcre-nqinaprq nyvra jne znpuvar pbageby bire na nfgrebvq" qbrf abg fbhaq yvxr n cyna juvpu jbhyq qrirybc gb n fbpvbcngu'f nqinagntr. (Gurl jbhyqa'g pner nobhg gur qnzntr gb bguref, ohg gurzfryirf?)
In conclusion, bring me back my cane and then get off my lawn, you're trampling the lilies.
Matthew Johnson, Fall from Earth [buying: publisher, audio]
Mind candy. Scheme-laden first-contact space opera with a social setting I can only call "The Ming Dynasty IN SPAAAAACE". Good enough that I will keep a look out for more from Johnson.
It's a small thing, but Johnson shows no appreciation of the energy required to move food from planet to planet, which makes his "equitable marketing system" a complete non-starter. (But he shares this flaw with Cherryh's deservedly-admired Downbelow Station.) If, however, the magistracy wants to make sure that no world can become self-sufficient, the way to do it would be to restrict their manufacturing, since any colony would be dependent for survival on a complex industrial infrastructure.
Bernard Williams, Truth and Truthfulness: An Essay in Genealogy
Shorter Williams: "Say what you mean. Bear witness. Iterate." (The late John M. Ford, in a different context.)
Slightly longer: You can get a decent sense of what the book is about from the publishers, so I'll comment without much exposition.
When Williams talks about a "genealogy" of some idea or practice, he means an account of why, if it did not exist, we would have to invent it. Specifically, he spins a state-of-nature story about how if, in the state of nature, human beings did not have an idea of truth, but nonetheless were social and rational animals, and so dependent on a division of epistemic labor, they would have to form one, and two "virtues of truthfulness", namely "sincerity" (Ford's "say what you mean") and "accuracy" (Ford's "bear witness") to make it effective. This is not intended as history or pre-history (Williams: "the state of nature is not the Pleistocene"), but it is a bit mysterious to me how then it is supposed to explain our notions of truth, truthfulness, sincerity, accuracy, etc., much less explain them "non-reductively". Perhaps — this is suggested by his section on "Shameful Origins" — it is just supposed to make us feel better about having them, by convincing us that we could have acquired such ideas in a way which doesn't discredit them. (We are not suckers.)
It may sound odd to describe "accuracy" as a virtue, but being accurate --- bearing good witness --- means things like check tendencies to leap to conclusion, choosing appropriate methods of inquiry, taking pains to secure all the relevant facts (Williams is especially good on the notion of "facts"), etc. Williams is indeed eloquent on how the virtues of accuracy are one of the things which have made the pursuit of science a source of human values, especially in circumstances where honesty otherwise was hard.
As this last suggests, culture lets us articulate the raw virtues of sincerity and accuracy into incredibly elaborate and interlocking complexes of attitudes and practices (Ford's "iterate"). From the inside, these have, or at least seem to have, intrinsic as well as instrumental value, and indeed they would not work at if their value was just instrumental. I confess that I do not fully follow Williams's attempt to try to explain when or why or how the virtues of truth become "intrinsic values". It seems to be something like: people find these values compelling, in a way which they would not if they saw them just as handy tools for achieving selfish ends; this in turn makes these values successful commitment devices [1]. Williams seems to me to equivocate as to whether these virtues really do have such intrinsic value, but on balance I am just as happy that he strayed no deeper into the swamp of meta-ethics, and wisely turned back to the sounder terrain of looking at certain episodes in the articulation of these virtues. The two main case-studies he gives are contrasts of Thucydides and Herodotus on history, and of Rousseau and Diderot on authenticity and the self. Both of these really have a wider, philosophical import, and as such they would both have been stronger for a more comparative, cross-cultural perspective — not in the service of the small virtue of courtesy (Williams has mercifully few "what you mean 'we', white man?" moments), but rather in the service of the great virtue of accuracy [2].
But I see that I am descending into my usual quibbling. This is a profoundly thoughtful and profoundly learned book, which says interesting things to say about some of the deepest and most humanly-important problems in philosophy, and says them elegantly. Go read.
[1] I cannot help but be reminded of William James:
Now, why do the various animals do what seem to us such strange things, in the presence of such outlandish stimuli? Why does the hen, for example, submit herself to the tedium of incubating such a fearfully uninteresting set of objects as a nestful of eggs, unless she have some sort of a prophetic inkling of the result? The only answer is ad hominem. We can only interpret the instincts of brutes by what we know of instincts in ourselves. Why do men always lie down, when they can, on soft beds rather than on hard floors? Why do they sit round the stove on a cold day? Why, in a, room, do they place themselves, ninety-nine times out of a hundred, with their faces towards its middle rather than to the wall? Why do they prefer saddle of mutton and champagne to hard-tack and ditch-water? Why does the maiden interest the youth so that everything about her seems more important and significant than anything else in the world? Nothing more can be said than that these are human ways, and that every creature likes its own ways, and takes to the following them as a, matter of course. Science may come and consider these ways, and find that most of them are useful. But it is not for the sake of their utility that they are followed, but because at the moment of following them we feel that that is the only appropriate and natural thing to do. Not one man in a billion, when taking his dinner, ever thinks of utility. He eats because the food tastes good and makes him want more. If you ask him why he should want to eat more of what tastes like that, instead of revering you as a philosopher he will probably laugh at you for a fool. The connection between the savory sensation and the act it awakens is for him absolute and selbstverständlich, an "a priori synthesis" of the most perfect sort, needing no proof but its own evidence. It takes, in short, what Berkeley calls a mind debauched by learning to carry the process of making the natural seem strange, so far as to ask for the why of any instinctive human act. To the metaphysician alone can such questions occur as: Why do we smile, when pleased, and not scowl? Why are we unable to talk to a crowd as we talk to a single friend? Why does a particular maiden turn our wits so upside-down? The common man can only say, "Of course we smile, of course our heart palpitates at the sight of the crowd, of course we love the maiden, that beautiful soul clad in that perfect form, so palpably and flagrantly made from all eternity to be loved!"
And so, probably, does each animal feel about the particular things it tends to do in presence of particular objects. They, too, are a priori syntheses. To the lion it is the lioness which is made to be loved; to the bear, the she-bear. To the broody hen the notion would probably seem monstrous that there should be a creature in the world to whom a nestful of eggs was not the utterly fascinating and precious and never-to-be-too-much-sat-upon object which it is to her.
Thus we may be sure that, however mysterious some animals' instincts may appear to us, our instincts will appear no less mysterious to them. And we may conclude that, to the animal which obeys it, every impulse and every step of every instinct shines with its own sufficient light, end seems at the moment the only eternally right and proper thing to do. It is done for its own sake exclusively. What voluptuous thrill may not shake a fly, when she at last discovers the one particular leaf, or carrion, or bit of dung, that out of all the world can stimulate her ovipositor to its discharge? Does not the discharge then seem to her the only fitting thing? And need she care or know anything about the future maggot and its food?
More soberly, or at least with fewer hens and maggots, this is highly reminiscent of Robert Frank's Passion within Reason, which I do not believe Williams mentions. ^
[2] Williams claims, quite plausibly, that Thucydides had different ideas about historical explanation and historical evidence than did Herodotus — ones which are both stricter about what counts as acceptable history, and which are supported by compelling rationales even within the older framework. He also claims, more sketchily, that Herodotus was immersed in a culture which was still partly oral and partly literature, while Thucydides was not. If all this was right, should not the same contrast show up in the historical traditions of China, the Islamic world, etc.? Why does such a tradition not seem to be indigenous to India? (Cf., on all this, Brown's History, Hierarchy, and Human Nature.) Western Europe, after the fall of the western Roman Empire, never lost literacy, but it certainly didn't produce histories like Thucydides's for many centuries: why, on Williams's account, not? (Actually, outside of Italy, did western Europe ever produce such histories before the fall of the empire?) If there are important distinctions between these cases, such that Williams's account applies only in the special circumstances of the Aegean around 500--300 BC, what are those circumstances? — Let me add that it was Williams who made all these considerations relevant, not me. ^
J. C. W. Rayner and D. J. Best, Smooth Tests of Goodness of Fit
Suppose a random variable \( Y \) is confined to the unit interval \( [0,1] \), and we want to test whether it is uniformly distributed. One way to do this would be to construct alternative distributions which are in some sense smooth departures from uniformity, with densities \( g(y;\theta) = e^{\sum_{j=1}^{d}{\theta_j h_j(y)}}/z(\theta) \), where it is convenient to chose the \( h_j \) functions to be an orthonormal basis --- the cosine basis, say, or the Legendre polynomials. (That is, they are orthonormal in \( L_2 \), the space of square-integrable functions on the unit interval.) Uniformity is then the special case \( \theta = 0 \), and we can test it against the alternative that \( \theta \neq 0 \) by the usual devices of a likelihood-ratio test, a score test, etc., which will all, under the null hypothesis, have an asymptotic \( \chi^2_d \) distribution. This is Neyman's original smooth test, which seems to have originated from the problem of how to combine p-values from independent experiments, which should all be uniformly distributed under the null hypothesis. One nice feature of this test is that if we reject the null, we immediately have an alternative, namely our maximum likelihood estimate of \( \theta \), for what the actual distribution is --- it tells us not just that the null model is wrong, but how, and what a better one would be like.
The real power of this comes from the following observation. If \( X \) is distributed according to some continuous CDF \( F \), then \( Y=F(X) \) is uniformly distributed on \( [0,1] \). The smooth alternatives for \( Y \) translate into smooth alternatives for \( X \), with densities \( g_X(x,\theta) = f(x) e^{\sum_{j=1}^{d}{\theta_j h_j(F(x))}}/z(\theta) \). We can test whether \( X \sim F \) by, once again, testing with \( \theta = 0 \), and the theory works just as before. If \( F \) is not fixed but involves some parameters \( \beta \), then we consider the smooth alternative densities \( g_{X}(x;\beta,\theta) = f(x;\beta) e^{\sum_{j=1}^{d}{\theta_j h_j(F(x;\beta))}}/z(\theta) \), and again we test the specification by testing \( \theta = 0 \). Since this always involves fixing \( d \) parameters, we always get a \( \chi^2_d \) asymptotic distribution under the null.
Rayner and Best's monograph is a clear, if now somewhat old-fashioned, exposition of Neyman's smooth test and its relatives and extensions. They actually begin with Pearson's \( X^2 \) or \( \chi^2 \) test, which can be seen as a smooth test for multinomial (rather than continuous) data, before going on to consider the general theory of likelihood ratio and score tests, and Neyman's smooth tests. Much of the book is taken up with various permutations of discretizing continuous variables and/or allowing estimation of the parameters I have written \( \beta \); the latter concern seems less important these days.
An important set of developments which does not get as much attention here as a more recent treatment would give is that of picking the order of the alternatives \( d \). Neyman suggested \( d = 4 \) but emphasized it was guess; some later workers guessed \( d = 2 \) should be enough. Really, however, this is a problem of model selection or capacity control, and so all the usual tools, like cross-validation or information criteria, can be applied. This is one place where BIC has proved particularly useful, leading to "data-driven" smooth tests. These no longer have nice \( \chi^2 \) asymptotics, but it's pretty easy to get their sampling distributions from simulation.
Despite these limits, this is still a useful reference for people interested in specification checking.
Aliette De Bodard, Servant of the Underworld
Mind candy: historical fantasy/mystery set in Tenochtitlan (a few generations before what would be the Conquest), only with the mythology of the Aztecs being literally true and magic very much a part of actual life. It had some typical first-novel flaws (too much exposition, the plot drags in places), but overall decent.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Enigmas of Chance; Central Asia; Philosophy

Posted by crshalizi at April 30, 2012 23:59 | permanent link

April 24, 2012

Brought to You by the Letters D, A, and G (Advanced Data Analysis from an Elementary Point of View)

In which the arts of estimating causal effects from observational data are practiced on Sesame Street.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 24, 2012 10:31 | permanent link

Estimating Causal Effects from Observations (Advanced Data Analysis from an Elementary Point of View)

Estimating graphical models: substituting consistent estimators into the formulas for front and back door identification; average effects and regression; tricks to avoid estimating marginal distributions; propensity scores and matching and propensity scores as computational short-cuts in back-door adjustment. Instrumental variables estimation: the Wald estimator, two-stage least-squares. Summary recommendations for estimating causal effects.

Reading: Notes, chapter 24

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 24, 2012 10:30 | permanent link

April 22, 2012

Separated at Birth (Advanced Data Analysis from an Elementary Point of View)

In which we use graphical causal models to understand twin studies and variance components.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 22, 2012 12:00 | permanent link

April 21, 2012

Identifying Causal Effects from Observations (Advanced Data Analysis from an Elementary Point of View)

Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Summary recommendations for identifying causal effects.

Reading: Notes, chapter 23

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 21, 2012 12:00 | permanent link

April 20, 2012

Just How Quickly Do We Forget?

Attention conservation notice: 2500+ words on estimating how quickly time series forget their own history. Only of interest if you care about the intersection of stochastic processes and statistical learning theory. Full of jargon, equations, log-rolling and self-promotion, yet utterly abstract.

I promised to say something about the content of Daniel's thesis, so let me talk about two of his papers, which go into chapter 4; there is a short conference version and a long journal version.

Daniel J. McDonald, Cosma Rohilla Shalizi and Mark Schervish, "Estimating beta-mixing coefficients", AIStats 2011, arxiv:1103.0941
Abstract: The literature on statistical learning for time series assumes the asymptotic independence or "mixing" of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing rates from data. We give an estimator for the \( \beta \)-mixing rate based on a single stationary sample path and show it is \( L_1 \)-risk consistent.
----, "Estimating beta-mixing coefficients via histograms", arxiv:1109.5998
Abstract: The literature on statistical learning for time series often assumes asymptotic independence or "mixing" of data sources. Beta-mixing has long been important in establishing the central limit theorem and invariance principle for stochastic processes; recent work has identified it as crucial to extending results from empirical processes and statistical learning theory to dependent data, with quantitative risk bounds involving the actual beta coefficients. There is, however, presently no way to actually estimate those coefficients from data; while general functional forms are known for some common classes of processes (Markov processes, ARMA models, etc.), specific coefficients are generally beyond calculation. We present an \( L_1 \)-risk consistent estimator for the beta-mixing coefficients, based on a single stationary sample path. Since mixing coefficients involve infinite-order dependence, we use an order-d Markov approximation. We prove high-probability concentration results for the Markov approximation and show that as \( d \rightarrow \infty \), the Markov approximation converges to the true mixing coefficient. Our estimator is constructed using d dimensional histogram density estimates. Allowing asymptotics in the bandwidth as well as the dimension, we prove \( L_1 \) concentration for the histogram as an intermediate step.

Recall the world's simplest ergodic theorem: if \( X_t \) is a sequence of random variables with common expectation \( m \) and variance \( v \), and stationary covariance \( \mathrm{Cov}[X_t, X_{t+h}] = c_h \). Then the time average \( \overline{X}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{X_i} \) also has expectation \( m \), and the question is whether it converges on that expectation. The world's simplest ergodic theorem asserts that if the correlation time \[ T = \frac{\sum_{h=1}^{\infty}{|c_h|}}{v} < \infty \] then \[ \mathrm{Var}\left[ \overline{X}_n \right] \leq \frac{v}{n}(1+2T) \]

Since, as I said, the expectation of \( \overline{X}_n \) is \( m \) and its variance is going to zero, we say that \( \overline{X}_n \rightarrow m \) "in mean square".

From this, we can get a crude but often effective deviation inequality, using Chebyshev's inequality: \[ \Pr{\left(|\overline{X}_n - m| > \epsilon\right)} \leq \frac{v}{\epsilon^2}\frac{1+2T}{n} \]

The meaning of the condition that the correlation time \( T \) be finite is that the correlations themselves have to trail off as we consider events which are widely separated in time — they don't ever have to be zero, but they do need to get smaller and smaller as the separation \( h \) grows. (One can actually weaken the requirement on the covariance function to just \( \lim_{n\rightarrow \infty}{\frac{1}{n}\sum_{h=1}^{n}{c_h}} = 0 \), but this would take us too far afield.) In fact, as these formulas show, the convergence looks just like what we'd see for independent data, only with \( \frac{n}{1+2T} \) samples instead of \( n \), so we call the former the effective sample size.

All of this is about the convergence of averages of \( X_t \), and based on its covariance function \( c_h \). What if we care not about \( X \) but about \( f(X) \)? The same idea would apply, but unless \( f \) is linear, we can't easily get its covariance function from \( c_h \). The mathematicians' solution to this has been to invent stronger notions of decay-of-correlations, called "mixing". Very roughly speaking, we say that \( X \) is mixing when, if you pick any two (nice) functions \( f \) and \( g \), I can always show that \[ \lim_{h\rightarrow\infty}{\mathrm{Cov}\left[ f(X_t), g(X_{t+h}) \right]} = 0 \]

Note (or believe) that this is "convergence in distribution"; it happens if, and only if, the distribution of events up to time \( t \) is becoming independent of the distribution of events from time \( t+h \) onwards.

To get useful results, it is necessary to quantify mixing, which is usually done through somewhat stronger notions of dependence. (Unfortunately, none of these have meaningful names. The review by Bradley ought to be the standard reference.) For instance, the "total variation" or \( L_1 \) distance between probability measures \( P \) and \( Q \), with densities \( p \) and \( q \) is, \[ d_{TV}(P,Q) = \frac{1}{2}\int{|p(u) - q(u)| du} \] This has several interpretations, but the easiest to grasp is that it says how much \( P \) and \( Q \) can differ in the probability they give to any one event: for any \( E \), \( d_{TV}(P,Q) \geq |P(E) - Q(E)| \). One use of this distance is to measure how the dependence between random variables, by seeing far their joint distribution is from the product of their marginal distributions. Abusing notation a little to write \( P(U,V) \) for the joint distribution of \( U \) and \( V \), we measure dependence as \[ \beta(U,V) \equiv d_{TV}(P(U,V), P(U) \otimes P(V)) = \frac{1}{2}\int{|p(u,v)-p(u)p(v)|du dv} \] This will be zero just when \( U \) and \( V \) are statistically independent, and one when, on average, conditioning on \( U \) confines \( V \) to a set which would otherwise have probability zero. (For instance if \( U \) has a continuous distribution and \( V \) is a function of \( U \) — or one of two randomly chosen functions of \( U \).)

We can relate this back to the earlier idea of correlations between functions by realizing that \[ \beta(U,V) = \sup_{|r|\leq 1}{\left|\int{r(u,v) dP(U,V)} - \int{r(u,v)dP(U)dP(V)}\right|} ~, \] that \( \beta \) says how much the expected value of a bounded function \( r \) could change between the dependent and the independent distributions. (There is no assumption that the test function \( r \) factorizes, and in fact it's important to allow \( r(u,v) \neq f(u)g(v) \).)

We apply these ideas to time series by looking at the dependence between the past and the future: \[ \begin{eqnarray*} \beta(h) & \equiv & d_{TV}(P(X^t_{-\infty}, X_{t+h}^{\infty}), P(X^t_{-\infty}) \otimes P(X_{t+h}^{\infty})) \\ & = & \frac{1}{2}\int{|p(x^t_{-\infty},x_{t+h}^{\infty})-p(x^t_{-\infty})p(x^{\infty}_{t+h})|dx^t_{-\infty}dx^{\infty}_{t+h}} \end{eqnarray*} \] (By stationarity, the integral actually does not depend on \( t \).) When \( \beta(h) \rightarrow 0 \) as \( h \rightarrow \infty \), we have a "beta-mixing" process. (These are also called "absolutely regular".) Convergence in total variation implies convergence in distribution, but not vice versa, so beta-mixing is stronger than common-or-garden mixing.

Notions like beta-mixing were originally introduced purely for probabilistic convenience, to handle questions like "when does the central limit theorem hold for stochastic processes?" These are interesting for people who like stochastic processes, or indeed for those who want to do Markov chain Monte Carlo and want to know how long to let the chain run. For our purposes, though, what's important is that when people in statistical learning theory have given serious attention to dependent data, they have usually relied on a beta-mixing assumption.

The reason for this focus on beta-mixing is that it "plays nicely" with approximating dependent processes by independent ones. The usual form of such arguments is as follows. We want to prove a result about our dependent but mixing process \( X \). For instance, we realize that our favorite prediction model will tend to do worse out-of-sample than on the data used to fit it, and we might want to bound the probability that this over-fitting will exceed \( \epsilon \). If we know the beta-mixing coefficients \( \beta(h) \), we can pick a separation, call it \( a \), where \( \beta(a) \) is reasonably small. Now we divide \( X \) up into \( \mu = n/a \) blocks of length \( a \). If we take every other block, they're nearly independent of each other (because \( \beta(a) \) is small) but not quite (because \( \beta(a) \neq 0 \)). Introduce a (fictitious) random sequence \( Y \), where blocks of length \( a \) have the same distribution as the blocks in \( X \), but there's no dependence between blocks. Since \( Y \) is an IID process, it is easy for us to prove that, for instance, the probability of over-fitting \( Y \) by more than \( \epsilon \) is at most some small \( \delta(\epsilon,\mu/2) \). Since \( \beta \) tells us about how well dependent probabilities are approximated by independent ones, the probability of the bad event happening with the dependent data is at most \( \delta(\epsilon,\mu/2) + (\mu/2)\beta(a) \). We can make this as small as we like by letting \( \mu \) and \( a \) both grow as the time series gets longer. Basically, anything result which holds for an IID process will also hold for a beta-mixing one, with a penalty in the probability that depends on \( \beta \). There are some details to fill in here (how to pick the separation \( a \)? should the blocks always be the same length as the "filler" between blocks?), but this is the basic frame.

What it leaves open, however, is how to estimate the mixing coefficients \( \beta(h) \). For Markov models, one could it principle calculate it from the transition probabilities. For more general processes, though, calculating beta from the known distribution is not easy. In fact, we are not aware of any previous work on estimating the \( \beta(h) \) coefficients from observational data. (References welcome!) Because of this, even in learning theory, people have just assumed that the mixing coefficients were known, or that it was known they went to zero at a certain rate. This was not enough for what we wanted to do, which was actually calculate bounds on error from data.

There were two tricks to actually coming up with an estimator. The first was to reduce the ambitions a little bit. If you look at the equation for \( \beta(h) \) above, you'll see that it involves integrating over the infinite-dimensional distribution. This is daunting, so instead of looking at the whole past and future, we'll introduce a horizon, \( d \) steps away, and cut things off there: \[ \begin{eqnarray*} \beta^{(d)}(h) & \equiv & d_{TV}(P(X^t_{t-d}, X_{t+h}^{t+h+d}), P(X^t_{t-d}) \otimes P(X_{t+h}^{t+h+d})) \\ & = & \frac{1}{2}\int{|p(x^t_{t-d},x_{t+h}^{t+h+d})-p(x^t_{t-d})p(x^{t+h+d}_{t+h})|dx^t_{t-d}dx^{t+h+d}_{t+h}} \end{eqnarray*} \] If \( X \) is a Markov process, then there's no difference between \( \beta^{(d)}(h) \) and \( \beta(h) \). If \( X \) is a Markov process of order \( p \), then \( \beta^{(d)}(h) = \beta(h) \) once \( d \geq p \). If \( X \) is not Markov at any order, it is still the case that \( \beta^{(d)}(h) \rightarrow \beta(h) \) as \( d \) grows. So we have an approximation to \( \beta \) which only involves finite-dimensional integrals, which we might have some hope of doing.

The other trick is to get rid of those integrals. Another way of writing the beta-dependence between the random variables \( U \) and \( V \) is \[ \beta(U,V) = \sup_{\mathcal{A},\mathcal{B}}{\frac{1}{2}\sum_{a\in\mathcal{A}}{\sum_{b\in\mathcal{B}}{\left| \Pr{(a \cap b)} - \Pr{(a)}\Pr{(b)} \right|}}} \] where \( \mathcal{A} \) runs over finite partitions of values of \( U \), and \( \mathcal{B} \) likewise runs over finite partitions of values of \( V \). I won't try to show that this formula is equivalent to the earlier definition, but I will contend that if you think about how that integral gets cashed out as a sum, you can sort of see how it would be. If we want \( \beta^{(d)}(h) \), we can take \( U = X^{t}_{t-d} \) and \( V = X^{t+h+d}_{t+h} \), and we could find the dependence by taking the supremum over partitions of those two variables.

Now, suppose that the joint density \( p(x^t_{t-d},x_{t+h}^{t+h+d}) \) was piecewise constant, with those pieces being rectangles parallel to the coordinate axes. Then sub-dividing those rectangles would not change the sum, and the \( \sup \) would actually be attained for that particular partition. Most densities are not of course piecewise constant, but we can approximate them by such piecewise-constant functions, and make the approximation arbitrarily close (in total variation). More, we can estimate those piecewise-constant approximating densities from a time series. Those estimates are, simply, histograms, which are about the oldest form of density estimation. We show that histogram density estimates converge in total variation on the true densities, when the bin-width is allowed to shrink as we get more data.

Because the total variation distance is in fact a metric, we can use the triangle inequality to get an upper bound on the true beta coefficient, in terms of the beta coefficients of the estimated histograms, and the expected error of the histogram estimates. All of the error terms shrink to zero as the time series gets longer, so we end up with consistent estimates of \( \beta^{(d)}(h) \). That's enough if we have a Markov process, but in general we don't. So we can let \( d \) grow as \( n \) does, and that (after a surprisingly long measure-theoretic argument) turns out to do the job: our histogram estimates of \( \beta^{(d)}(h) \), with suitably-growing \( d \), converge on the true \( \beta(h) \).

To confirm that this works, the papers go through some simulation examples, where it's possible to cross-check our estimates. We can of course also do this for empirical time series. For instance, in his this Daniel took four standard macroeconomic time series for the US (GDP, consumption, investment, and hours worked, all de-trended in the usual way). This data goes back to 1948, and is measured four times a year, so there are 255 quarterly observations. Daniel estimated a \( \beta \) of 0.26 at one quarter's separation, \( \widehat{\beta}(2) = 0.15 \), \( \widehat{\beta}(3) = 0.02 \), and somewhere between 0 and 0.11 for \(\widehat{\beta}(4) \). (That last is a sign that we don't have enough data to go beyond \( h = 4 \).) Optimistically assuming no dependence beyond a year, one can calculate the effective number of independent data points, which is not 255 but 31. This has morals for macroeconomics which are worth dwelling on, but that will have to wait for another time. (Spoiler: \( \sqrt{\frac{1}{31}} \approx 0.18 \), and that's if you're lucky.)

It's inelegant to have to construct histograms when all we want is a single number, so it wouldn't surprise us if there were a slicker way of doing this. (For estimating mutual information, which is in many ways analogous, estimating the joint distribution as an intermediate step is neither necessary nor desirable.) But for now, we can do it, when we couldn't before.

Enigmas of Chance; Kith and Kin; Self-Centered

Posted by crshalizi at April 20, 2012 14:57 | permanent link

April 15, 2012

Graphical Causal Models (Advanced Data Analysis from an Elementary Point of View)

Probabilistic prediction is about passively selecting a sub-ensemble, leaving all the mechanisms in place, and seeing what turns up after applying that filter. Causal prediction is about actively producing a new ensemble, and seeing what would happen if something were to change ("counterfactuals"). Graphical causal models are a way of reasoning about causal prediction; their algebraic counterparts are structural equation models (generally nonlinear and non-Gaussian). The causal Markov property. Faithfulness. Performing causal prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules for linear models.

Reading: Notes, chapter 22

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 15, 2012 20:03 | permanent link

Exam: Is This Test Really Necessary? (Advanced Data Analysis from an Elementary Point of View)

In which the analysis of multivariate data is recursively applied.

Reading: Notes, assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 15, 2012 20:02 | permanent link

Graphical Models (Advanced Data Analysis from an Elementary Point of View)

Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth?

Reading: Notes, chapter 21

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 15, 2012 20:01 | permanent link

Mixture Models (Advanced Data Analysis from an Elementary Point of View)

From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: planes again. Probabilistic clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.

Extended example: Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components.

Reading: Notes, chapter 20; mixture-examples.R

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 15, 2012 20:00 | permanent link

April 08, 2012

"Generalization Error Bounds for Time Series"

On Friday, my student Daniel McDonald, who I have been lucky enough to jointly advise with Mark Schervish, defeated the snake — that is, defended his thesis:

Generalization Error Bounds for Time Series
In this thesis, I derive generalization error bounds — bounds on the expected inaccuracy of the predictions — for time series forecasting models. These bounds allow forecasters to select among competing models, and to declare that, with high probability, their chosen model will perform well — without making strong assumptions about the data generating process or appealing to asymptotic theory. Expanding upon results from statistical learning theory, I demonstrate how these techniques can help time series forecasters to choose models which behave well under uncertainty. I also show how to estimate the beta-mixing coefficients for dependent data so that my results can be used empirically. I use the bound explicitly to evaluate different predictive models for the volatility of IBM stock and for a standard set of macroeconomic variables. Taken together my results show how to control the generalization error of time series models with fixed or growing memory.
PDF

I hope to have a follow-up post very soon about the substance of Daniel's work, which is part of our INET grant, but in the meanwhile: congratulations, Dr. McDonald!

Kith and Kin; Enigmas of Chance; The Dismal Science

Posted by crshalizi at April 08, 2012 17:25 | permanent link

April 06, 2012

On Refereeing a Manuscript for PNAS with Roughly a Hundred Hypothesis Tests

Five percent of the time, the cigar is just blowing smoke.

Learned Folly; Enigmas of Chance

Posted by crshalizi at April 06, 2012 01:03 | permanent link

April 04, 2012

On Academic Talks: Memory and Fear

Attention conservation notice: 2000 words of advice to larval academics, based on mere guesswork and ill-assimilated psychology.

It being the season for job-interview talks, student exam presentations, etc., the problems novices have with giving them are much on my mind. And since I find myself composing the same e-mail of advice over and over, why not write it out once and for all?

Once you understand the purpose of academic talks, it becomes clear that the two fundamental obstacles to giving good talks are memory and fear.

The point of academic talk is to try to persuade your audience to agree with you about your research. This means that you need to raise a structure of argument in their minds, in less than an hour, using just your voice, your slides, and your body-language. Your audience, for its part, has no tools available to it but its ears, eyes, and mind. (Their phones do not, in this respect, help.)

This is a crazy way of trying to convey the intricacies of a complex argument. Without external aids like writing and reading, the mind of the East African Plains Ape has little ability to grasp, and more importantly to remember, new information. (The great psychologist George Miller estimated the number of pieces of information we can hold in short-term memory as "the magical number seven, plus or minus two", but this may if anything be an over-estimate.) Keeping in mind all the details of an academic argument would certainly exceed that slight capacity*. When you over-load your audience, they get confused and cranky, and they will either tune you out or avenge themselves on the obvious source of their discomfort, namely you.

Therefore, do not overload your audience, and do not even try to convey all the intricacies of a complex academic argument in your talk. The proper goal of an academic talk is to convey a reasonably persuasive sketch of your argument, so that your audience are better informed about the subject, get why they should care, and are usefully oriented to what you wrote if and when they decide to read your paper. In many ways a talk is really an extended oral abstract for your paper. (This is more effective if those who are interested can read your paper, at an open pre-print archive or at least on your website.) Success in this means keeping your audience's load low, and there are two big ways to do that: make it easier for them to remember what matters, and reduce what they have to remember.

People can remember things more easily if they have a scheme they can relate them to, which helps them appreciate their relevance. Your audience will come to the talk with various schemata; use them.

You can and should also help your audience build new schemata.

As for limiting the information the audience needs to remember, the main rule is to ask yourself "Do they need to know this to follow the argument?" and "Will they need to remember this later?" If they do not need to know it even for a moment, cut it. (Showing or telling them details, followed by "don't worry about the details", does not work.) If they will need to remember it later, emphasize it, and remind them when you need it.

To answer "Do they need to know this?" and "Will they have to recall this?", you need to be intimately familiar with the logic of your own talk. The ideal of such familiarity is to have that logic committed to memory — the logic, not some exact set of words. When you really understand it, when you grasp all the logical connections and see why everything that's necessary is needed, the argument can "carry you along" through the presentation, letting you compose appropriate words as you go, without rote memorization. This has many advantages, not least the ability to field questions.

As a corollary to limiting what the audience needs to remember, if you are using slides, their text should be (1) prompts for your exposition and your audience's memory, or (2) things which are just too hard to say, like equations**. (Do not, whatever you do, read aloud the text of your slides.) But whether spoken or on the slide, cut your talk down to the essentials. This requires you to know what is essential.

"But the lovely, no the divine, details!" you protest. "All those fine points I checked, all the intricate work I did, all the alternatives I ruled out? When do I get to talk about them?" To which there are several responses.

  1. The point of the talk is not to please you, by reminding yourself of what a badass you are, but to tell your audience something useful and interesting. (Note to graduate students: It is important that you internalize that you are, in fact, a badass, but it is also important that then you move on. Needing to have your ego stroked by random academics listening to talks is a sign that you have not yet reached this stage.) Unless something matters to your actual message, it really doesn't belong in the main body of the talk.
  2. You can stick an arbitrary amount of detail in the "I'm glad you asked that" slides, which go after the one which says "Thank you for your attention! Any questions?".
  3. You also can and should put all these details in your paper, and the people who really care, to whom it really matters, will go read your paper. Once again, think of an academic talk as an extended oral abstract.

To sum up on memory, then: successful academic talks persuade your audience of your argument. To do this, and not instead alienate your audience, you have to work with their capacities and prior knowledge, and not against them. Negatively, this means limiting the amount of information you expect them to retain. Positively, you need to use, and make, schemata which help them see the relevance of particulars. You can still give an awful talk this way (maybe your argument is incredibly bad), but you can hardly give a good talk without it.

The major consideration in crafting the content of your talk is your audience's memory. The major consideration for the delivery of the talk is your fear. (Your own memory is not so great, but you have of course internalized the schema for your own talk, and so you can re-generate it as you go, using your slides as prompts.) Public speaking, especially about something important to you, and to an audience whose opinion matters to you, is intimidating to many people. Fear makes you a worse public speaker; you mumble, you forget your place in the argument, you can't think on your feet, you project insecurity (possibly by over-compensating), etc. You do not need to become a great, fearless public speaker; you do need to be adequate at it. The three major routes to doing this, in my experience, are desensitization, dissociation, and deliberate acts.

Desensitization is simple: the more you do it, and emerge unscathed, the less fearful you will be. Practice giving your talks to safe but critical audiences. ("But critical" is key: you need them to tell you honestly what wasn't working well. [Something can always be done better.]) If you can't get a safe-but-critical audience, get an audience you don't care about (e.g., some random conference), and practice on them. Remind yourself, too, that while your talk may be a big deal for you, it's rarely a big deal for your audience.

Dissociation is about embracing being a performer on a stage: the audience's idea of you is already a fictional character, so play a character. It can, once again, be very liberating to separate the persona you're adopting for the talk from the person you actually are. If that seems unethical, go read The Presentation of Self in Everyday Life. An old-fashioned insistence that what really matters are the ideas, and not their merely human vessel, can also be helpful here.

Finally, deliberate actions are partly about communicating better, and partly about a fake-it-till-you-make-it assumption of confidence. (Some of these are culture-bound, so adjust as need be.) Project your voice to be heard through the room. (Don't be ashamed to use a microphone if need be.) Look at your audience (not your shoes or the screen), letting your eyes rove over them to gauge their facial expressions; don't be afraid to maintain eye contact, but keep moving on. Maintain a nearly-conversational speed of talking; avoid long pauses. When fielding questions, don't defer to senior people or impose on your juniors; re-phrase the question before answering, to make sure everyone gets it, and to give yourself time to think about your reply. And for the sake of all that's holy, speak to the audience, not to a screen.

At the outset, I said that the two great obstacles to giving a good talk are memory and fear. The converse is that if you truly understand your own argument, and you truly believe in it, you can convey it in a way which works with your audience's memory, and overcome your own fear. The sheer mechanics of presentation will come with practice, and you will have something worth presenting.

Further reading:

*: Some branches of the humanities and the social sciences have the horrible custom of reading an academic paper out loud, apparently on the theory that this way none of the details get glossed over. The only useful advice which can be given about this is "Don't!". Academic prose has many virtues, but it is simply not designed for oral communication. Moreover, all of your audience consists of people who are very good at reading such prose, and can certainly do so at least as fast as you can recite it. Having people recite their papers, or even prepared remarks written in the style of a paper, does nothing except waste an hour in the life of the speaker and the audience — and none of us has hours to waste. ^

**: As a further corollary, and particularly important in statistics, big tables of numbers (e.g., regression coefficients) are pointless; and here "big" means "larger than 2x2". ^

Manual trackback: Rules of Reason; New APPS; Hacker News; paperpools; Nanopolitan; The Essence of Mathematics Is Its Freedom

Corrupting the Young

Posted by crshalizi at April 04, 2012 01:09 | permanent link

April 03, 2012

How the Recent Mammals Got Their Size Distribution (Advanced Data Analysis from an Elementary Point of View)

Homework 8: in which returning to paleontology gives us an excuse to work with simulations, and to compare distributions.

Assignment; MOM_data_full.txt

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 03, 2012 23:40 | permanent link

Red Brain, Blue Brain (Advanced Data Analysis from an Elementary Point of View)

Homework 8: in which we try to predict political orientation from bumps on the skull the volume of brain regions determined by MRI and adjusted by (unknown) formulas.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 03, 2012 09:20 | permanent link

Factor Analysis (Advanced Data Analysis from an Elementary Point of View)

Adding noise to PCA to get a statistical model. The factor model, or linear regression with unobserved independent variables. Assumptions of the factor model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. Our first look at latent variables and conditional independence. Geometrically, the factor model says the data cluster on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.

Reading: Notes, chapter 19; factors.R and sleep.txt

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 03, 2012 09:15 | permanent link

Principal Components Analysis (Advanced Data Analysis from an Elementary Point of View)

Principal components is the simplest, oldest and most robust of dimensionality-reduction techniques. It works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the projection of the data on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.

Reading: Notes, chapter 18; pca.R, pca-examples.Rdata, and cars-fixed04.dat

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 03, 2012 09:10 | permanent link

Relative Distributions and Smooth Tests (Advanced Data Analysis from an Elementary Point of View)

Applying the right CDF to a continuous random variable makes it uniformly distributed. How do we test whether some variable is uniform? The smooth test idea, based on series expansions for the log density. Asymptotic theory of the smooth test. Choosing the basis functions for the test and its order. Smooth tests for non-uniform distributions through the transformation. Dealing with estimated parameters. Some examples. Non-parametric density estimation on [0,1]. Checking conditional distributions and calibration with smooth tests. The relative distribution idea: comparing whole distributions by seeing where one set of samples falls in another distribution. Relative density and its estimation. Illustrations of relative densities. Decomposing shifts in relative distributions.

Reading: Notes, chapter 17

Optional reading: Bera and Ghosh, "Neyman's Smooth Test and Its Applications in Econometrics"; Handcock and Morris, "Relative Distribution Methods"

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 03, 2012 09:00 | permanent link

March 31, 2012

Books to Read While the Algae Grow in Your Fur, March 2012

Attention conservation notice: I have no taste.

Elizabeth Bear, Range of Ghosts
Suppose that a very good fantasy novelist — someone who gets the attraction of heroic fantasy at a bone-deep level, and has a core a conviction that everything valuable involves a painful sacrifice (with luck, only proportionately painful) — decided to dig back to, or through, the roots of the genre in writers like Tolkien, Howard and Leiber. Suppose then that she also took inspiration from the medieval history of Central Asia, and especially from books like Grosset and The Tibetan Empire in Central Asia and Beckwith. We might then find ourselves with a fantasy trilogy opening with a war between the sons and the grandsons of the Great Khagan for dominion over the empire of the steppes. If we are fortunate (and we are), this will go on with: shamanistic sorcerers trained in isolated mountain kingdoms; hungry ghosts; butterflies of ill omen; the peculiar beauty of fertile valleys in the high desert and their towns; rocs and their taming; Baluchitherium; death cults building their fortresses on desert mesas; bipedal tiger-demons; haunted kurgans; lunar yet vivid landscapes; necromancers drawing all the peoples of the world into war; hair-breadth escapes; painful wounds; plausible dynastic politics; wrenching choices; remarkable horses; yurts; heroes who have learned "to speak the truth and to handle bow and arrow well"; many very different big skies; polyandry; colored salt from the roof of the world; a Song monk wandering the western lands on a mysterious errand; and (spoiler follows) n phefrq evat bs tbyq juvpu gheaf vgf jrnere vaivfvoyr.
I realize that some of the buttons this pushes for me are rather arcane, but honestly it has been years since I read a novel in this genre with the same enjoyment and the same "but how does it the story go on?" feeling at the end.
(More spoilers: V guvax gur Pneevba Xvat vf zber be yrff Na Yhfuna/Ebuxfuna, juvyr Qhancngv vf Nggvyn, ohg V nz ernyyl abg fher nobhg rvgure vqragvsvpngvba, naq vg'f n avpr dhrfgvba jung rvgure jbhyq zrna. Naq: Grzhe, lbh gjb-gvzvat pnq.)
Author's self-presentations: 1, 2, 3
Orhan Pamuk, My Name Is Red
A re-read; fortunately I'd forgotten the solution to the murder mystery, which of course is not the point. The point is rather: art, memory, ambition, longing, melancholy, transience, eternity, tradition, style, individuality, imperfection, perspective, vision, blindness. And then of course there are the games following from a novel about the Ottoman heirs of the Old Masters of Herat trying to learn the methods of the "Frankish and Venetian masters", being written by a Turk who has obviously mastered the methods of masters like Calvino. (It is not clear to me if the story-within-a-story technique is really a nod to the 1001 Nights, or rather to The Castle of Crossed Destinies and Invisible Cities.)
I rather disliked the first book of Pamuk's I read, but this is wonderful, and you should go read it if you haven't.
Ideally, however, there would be an illuminated edition.
(The one thing I would change, and I realize this is petty so I put it in the end in small type, is the disquisition about the meaning of oral sex, which is not at all indecent but just inadvertently funny, like the worst bits of Updike.)
Clark Ashton Smith, Zothique
Mind candy. Smith was a talented early fantasy writer, overlapping with science fiction and horror, of the same vintage as Lovecraft, largely remembered, nowadays, as the latter's friend. This is a bit unfair. Smith didn't have the same power of vision that Lovecraft did, but he was a much better story-teller, and an actually-good stylist. (Smith was also much more pervy and directly influenced by turn-of-the-century decadence, which may be a feature or a bug, depending.) Zothique is a collection, edited by Lin Carter, of Smith's fantasy stories set in the far future, named after Earth's last continent, when the sun has dimmed and weird magics haunt a world full of ruins. If they weren't a direct inspiration for Vance's The Dying Earth (and so much else), I will spend a night reciting "The Empire of the Necromancers" in a Pittsburgh cemetery. This are good stories of their kind (ObDisclaimer: casually racist and misogynist author was casually racist and misogynist), but the truth is, Vance was just better.
This particular anthology is long out of print, but his complete stories are online.
I see that I bought my copy from Moe's Books in Berkeley on 23 April 1993; I am indeed a sloth.
Martha Wells, The Cloud Roads and The Serpent Sea
Mind candy, but, like all of Wells's books, very high quality mind candy. These are romances of caste and ecology among the social lizards. (More exactly, shape-shifting social lizard-men, who mercifully owe nothing to Anunnaki mythology.) I gather that there will be a third book in the series in 2013.
ObLinkage: Author's self-presentation for The Cloud Roads
Marius Iosifescu and Serban Grigorescu, Dependence with Complete Connections and Its Applications
Full-length review: Memories Fading to Infinity
— I used to joke that my nightmares included giving a talk and having an member of the audience announce at the end, in a thick Eastern European accent, that everything I'd said was a special case of a theorem his adviser had published in the Proceedings of the Academy of Agro-Technical Sciences of Outer Yajanistan in 1962. (I am under no illusions about being funny.) Reading Dependence with Complete Connections is a bit like wandering into that joke. I have been dealing with chains with complete connections since my first paper, though for most of the time I didn't realize that's what they were. I can salve my pride by saying that the problems I'm interested in (e.g.: given a stochastic process find the smallest random system with complete connections which generates it, for a particular value of "small") are not the ones solved by the old masters of Bucharest. But when I think of all the times I've said "they're like HMMs, only not exactly", I feel very low.
Tony Judt with Timothy Snyder, Thinking the Twentieth Century
Partly Judt's autobiography, partly Judt and Snyder conversing about the intellectual history of the twentieth century, intellectuals in politics, and about how intellectuals ought to behave, and just how far that has been from our actual conduct. At least as presented here, Snyder's main contribution to the book was making it possible at all — a truly moving story, so I will just refer to Thinking the Twentieth Century as Judt's.
Little in this book will surprise those who have read Judt's previous books, especially Postwar and Reappraisals, but this is rather more concentrated and systematic than his other works, and perhaps more accessible than the vast Postwar. As I've said before, I find a lot of Judt's views sympathetic and generally well-argued, and his prescriptive ideas very attractive. I'm glad we have this.
But, following a proud tradition, I am going to mostly quibble with a minor point, or rather some absences. It was very striking this time just how Eurocentric and literary-ideological Judt's perspective on intellectual life was. Intellectuals are authors from Europe or North America who write on politics, morals, or the arts. Anything about the natural world, technology, or math is right out. (There is a partial exception as to math in favor of economists, but even then the only two treated at length, or even I think by name, are Keynes and Hayek, who are, of course, unusually non-mathematical economists. [Actually, the discussion of how Hayek's road-to-serfdom ideas relate to Austrian politics in the '30s is very interesting.]) The image of a Central European thinker with massive influence from the middle of the 20th century onwards is Martin Heidegger*, not John von Neumann; the image of an intellectual in politics in Léon Blum, not Jawaharlal Nehru. (Even if one only cares about intellectuals as ideologists, von Neumann has a lot of claim to our attention.) It is not at all obvious that this is the best way to look at the century which saw the dissolution of the European empires, or the enterprise of science and technology assuming such vast size and consequence. One could defend these choices of perspectives as selections ("I am interested in this corner of the whole panorama of human intellectual life") or as judgments ("this is the most important history, for such-and-such reasons"), but I don't think Judt (or Snyder) realizes that they are choices.
(I will say nothing about Judt's pronouncements on American feminism in the last chapter, out of respect for his memory.)
*: In saying this, I don't mean for a moment to suggest that Judt agrees with Heidegger about, well, anything.
D. R. Cox and Christl A. Donnelly, Principles of Applied Statistics
Full-length review in American Scientist.
Shorter me: There are two great traditions of applied statistics. One is what we now call "data mining" or (distastefully) "data science". The other is aims at solving scientific problems, and is what Cox has been contributing to for longer than most working statisticians have been alive. Short of apprenticing oneself to a master of the art for a few years, there is no better introduction to how one translates between scientific questions and statistical problems.
Nota bene, D. R. Cox, the eminent real-world statistician, is not to be confused with the fictional Dr. Cox, despite what some search engines might suggest. Indeed, so far as I know, no one has ever suggested that the actual Prof. Cox is a "bastard-coated bastard with bastard filling".
Amanda Downum, The Kingdoms of Dust
Mind candy. Suppose that Lovecraft's "Colour out of Space" had been what happened to Iram, City of Pillars — what then? (Previous adventures of our heroine.)
Jack Knight and James Johnson, The Priority of Democracy: Political Consequences of Pragmatism
Full-length review: Dissent is the Health of the Democratic State
Shorter me: This is actually a very deep book about democracy, and why, exactly, it is so awesome, but it's written so obscurely it will have no impact, which is a shame.
Seanan McGuire, Discount Armageddon
Mind candy. Between the community of monsters living among us in the big city, the generations-old organization dealing with same, the heroine's day-job as a scantily-clad waitress at a dubiously-themed bar, and (most of all) the chapter headings, if this isn't descended from (high quality) Middleman fanfic, there's been a lot of lateral genetic transfer. This, to be clear, is a good thing.
ObLinkage: Author's self-presentation.
Lauren Willig, The Garden Intrigue
Mind candy. By this point my commitment to the series and characters is slightly disturbing.
Franklin E. Zimring, The Great American Crime Decline
A nicely written summary of the established facts about the huge and enduring decline in crime in America during the 1990s, as well as just how little we understand about its causes. (Zimring has some malicious, but entirely justified, fun at the expense of those who, in the early and even mid 1990s, confidently predicted a massive crime surge*.) The "usual suspects" --- demographics, unemployment, and locking people up** --- all pointed towards a decline in the crime rate, but nobody, before it happened, would have been able to predict the massive scale of the decline. Even retrospectively, it is hard to make these account for more than about half of the decline***. Causal theories advanced after the decline had become obvious have obvious selection-bias problems, and when one tries to cross-check them by looking at what they would imply for phenomena other than national crime totals, the results are not happy****. (I am a little surprised that no one has tried to argue for the benevolent influences of e-mail, first-person shooters and hypertext.) Going over these facts and the evidence for and against putative causes involves a lot of examination of data and methodological criticism, but Zimring is good at conveying this clearly.
Zimring puts a lot of stress on two aspects of the crime decline. First, crime rates fell much more in New York city than in the rest of the country --- roughly twice as much. Something had to be very different about New York, even compared to other large cities in the US. What is still stranger is that, as Zimring says, in many ways New York was much the same city in 2000 has it had been in 1990, when it had several times as much serious and violent crime — there had been no vast social or moral change.
Second, and I find these even more interesting than the stuff about New York, the crime decline in the US was paralleled by a crime decline in Canada of similar timing and magnitude — but not in other developed countries. Yet Canada had no massive surge of incarceration, no huge expansion of policing, not even the same sort of economic boom in the 1990s... In fact, the most astonishing thing for me in the whole book is Figure 5.23 on p. 132, showing US and Canadian murder rates tracking each other with eerie precision from 1961 to 2002. Unless this is the consequence of a massive exercise in juking the stats, no explanation which focuses on causes only working in either country can be very plausible.
The Great American Crime Decline was published in 2006. I'd love to read an update, but even so I learned a great deal from it, and recommend it.
(Thanks to M. R. for telling me about this book and lending me her copy.)
*: Considering the prominent place in this scare-mongering of scholars like James Q. Wilson and John "Super-predators" DiIulio, there is an interesting essay to be written about the selective skepticism of conservatives and neo-conservatives towards social science forecasting.
**: Except, of course, for the crimes those in prison perpetrate on each other, as they notoriously do.
***: The major demographic variable here is the number of adolescents and young adults, especially the number of teenage boys and young men. As Zimring nicely puts it, however, the fact that young males are always disproportionately likely to be criminals doesn't help us predict the number of crimes from the number of young males. To do that, we would have to know not just the crime rates among different demographic groups, but also be able to extrapolate those rates into the future. If anyone has figured out how to do that, they're not telling.
***: Zimring also has some nice, tart examples here of "the cross-sterilization of the social sciences" (a phrase he attributes to an unnamed judge), especially when it comes to economists — i.e., Steve "Freakonomics" Levitt — writing about crime and demographics.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Enigmas of Chance; The Progressive Forces Central Asia; The Collective Use and Evolution of Concepts; The Commonwealth of Letters; Writing for Antiquity; Commit a Social Science; Pleasures of Detection, Portraits of Crime

Posted by crshalizi at March 31, 2012 23:59 | permanent link

March 30, 2012

Pearl of Great Prize

From the all-too-small Department of Unambiguously Good Things Happening to People Who Thoroughly Deserve Them, Judea Pearl has won the Turing Prize for 2011. As a long-time admirer*, I could not be more pleased, and would like to take this opportunity to recommend his "Causal Inference in Statistics" again.

*: I realize it edges into "I liked Feynman before he joined the Manhattan Project; the Williamsburg Project was edgier" territory, but I have very vivid memories of reading Probabilistic Reasoning in Intelligent Systems in the winter months of early 1999, and being correspondingly excited to hear that the first edition of Causality was coming out...

Enigmas of Chance; Constant Conjunction Necessary Connexion

Posted by crshalizi at March 30, 2012 15:30 | permanent link

March 29, 2012

Sparsity as Sorcery (Next Two Weeks at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about high-dimensional statistics and (2) will be in Pittsburgh over the next two weeks.

I am not sure how our distinguished speakers would feel at being called sorcerers, but since one of them is using sparsity to read minds, and the other to infer causation from correlation, it is hard to think of a more appropriate word.

Bin Yu, "Sparse Modeling: Unified Theory and Movie Reconstruction Based on Brain Signal"
Abstract: Information technology has enabled the collection of massive amounts of data in science, engineering, social science, finance and beyond. Statistics is the science of data and indispensable for extracting useful information from high-dimensional data. After broad successes of statistical machine learning on prediction through regularization, interpretability is gaining attention and sparsity is used as its proxy. With the virtues of both regularization and sparsity, L1 penalized Least Squares (e.g. Lasso) has been intensively studied by researchers from statistics, applied mathematics, and signal processing. Lasso is a special case of sparse modeling and has also been the focus on compressive sensing lately.
In this talk, I would like to cover both theory and practice of Lasso and its extensions. First, I will present an insightful unified analysis of M-estimation with decomposable penalties under sparse high dimensional statistical models. Second, I will present collaborative research with the Gallant Neuroscience Lab at Berkeley on understanding human visual pathway. In particular, I will show how we use non-linear sparse models (SPAM) to improve encoding and decoding results for the visual cortex area V1, and I will explain how Lasso and ridge methods enter our movie reconstruction algorithm from fMRI brain signals (dubbed by TIME Magazine as "mind-reading computers" and selected as one of its 50 Best Inventions of 2011).
Time and place: 4--5 pm on Monday, 2 April 2012, in Scaife Hall 125
Peter Bühlmann, "Predicting Causal Effects in High-Dimensional Settings"
Abstract: Understanding cause-effect relationships between variables is of great interest in many fields of science. An ambitious but highly desirable goal is to infer causal effects from observational data obtained by observing a system of interest without subjecting it to interventions. This would allow to circumvent severe experimental constraints or to substantially lower experimental costs. Our main motivation to study this goal comes from applications in biology.
We present recent progress for prediction of causal effects with direct implications on designing new intervention experiments, particularly for high-dimensional, sparse settings with thousands of variables but based on only a few dozens of observations. We highlight exciting possibilities and fundamental limitations. In view of the latter, statistical modeling needs to be complemented with experimental validations: we discuss this in the context of molecular biology for yeast (Saccharomyces cerevisiae) and the model plant Arabidopsis thaliana.
Time and place: 4--5 pm on Wednesday, 11 April 2012, in Scaife Hall 125

As always, the talks are free and open to the public; hecklers will, however, be turned into newts.

Enigmas of Chance; Minds, Brains, Neurons

Posted by crshalizi at March 29, 2012 13:10 | permanent link

March 26, 2012

Networks, Crowds, and More Networks (This Week at the Statistics and Machine Learning Seminars)

Attention conservation notice: Only of interest if you (1) care about statistical models of networks or collective information-processing, and (2) will be in Pittsburgh this week.

I am behind in posting my talk announcements:

Andrew Thomas, "Marginal-Additive Models and Processes for Network-Correlated Outcomes"
Abstract: A key promise of social networks is the ability to detect and model the correlation of personal attributes along the structure of the network, in either static or dynamic settings. The basis for most of these models, the Markov Random Field on a lattice, has several assumptions that may not be reflected in real network data, namely the assumptions that the process is stationary on the lattice, and that the ties in the model are correctly specified. Additionally, it is less than clear how correlation over longer distances on networks can be adequately specified under the lattice mechanism, given the assumption of a stationary process at work.
Based on concepts from generalized additive models and spatial/geostatistical methods, I introduce a class of models that is more robust to the failure of these assumptions, more flexible to different definitions of network distance, and more generally applicable to large-scale studies of network phenomena. I apply this method to outcomes from two large-scale social network studies to demonstrate its use and versatility.
Time and place: 4--5 pm on Monday, 26 March 2012, in Scaife Hall 125
Sewoong Oh, "Learning from the Wisdom of the Crowd: Efficient Algorithms and Fundamental Limits"
Abstract: This talk is on designing extremely efficient and provably order-optimal algorithms to extract meaningful information from societal data, the kind of data that comes from crowdsourcing platforms like Amazon Mechanical Turk, or recommendation systems like the Netflix Challenge dataset. Crowdsourcing systems, like Amazon Mechanical Turk, provide platforms where large-scale projects are broken into small tasks that are electronically distributed to numerous on-demand contributors. Because these low-paid workers can be unreliable, we need to devise schemes to increase confidence in our answers, typically by assigning each task multiple times and combining the answers in some way. I will present the first rigorous treatment of this problem, and provide both an optimal task assignment scheme (using a random graph) and an optimal inference algorithm (based on low-rank matrix approximation and belief propagation) for that task assignment. This approach significantly outperforms previous approaches and, in fact, is asymptotically order-optimal, which is established through comparisons to an oracle estimator. Another important problem in learning from the wisdom of the crowd is how to make product recommendations based on past user ratings. A common and effective way to model these user ratings datasets is to use low-rank matrices. In order to make recommendations, we need to predict the unknown entries of a ratings matrix. A natural approach is to find a low-rank matrix that best explains the observed entries. Motivated by this recommendation problem, my approach is to provide a general framework for recovering a low-rank matrix from partial observations. I will introduce a novel, efficient and provably order-optimal algorithm for this matrix completion problem. The optimality of this algorithm is established through a comparison to a minimax lower bound on what the best algorithm can do.
Time and place: 10--11 am on Wednesday, 28 March 2012, in Gates Hall 6115
Lise Getoor, "Collective Graph Identification"
Abstract: The importance of network analysis is growing across many domains, and is fundamental in understanding online social interactions, biological processes, communication, ecological, financial, transportation networks, and more. In most of these domains, the networks of interest are not directly observed, but must be inferred from noisy and incomplete data, data that was often generated for purposes other than scientific analysis. In this talk, I will introduce the problem of graph identification, the process of inferring the hidden network from noisy observational data. I will describe some of the component steps involved, and then I will describe a collective approach to graph identification, which interleaves the necessary steps in the accurate reconstruction of the network. Time permitting, I will also survey some of the work in my group on probabilistic databases, privacy, visual analytics, and active learning.
Time and place: 4:30--5:30 pm on Thursday, 29 March 2012, in Gates Hall 6115
As always, all talks are free and open to the public.

Enigmas of Chance; Networks; The Collective Use and Evolution of Concepts

Posted by crshalizi at March 26, 2012 10:00 | permanent link

March 21, 2012

Signs I Will Not Recommend Your Manuscript Be Published As Is (No. 891)

You are a theoretical physicist, trying to do data analysis, and "Such a Shande far de Goyim!" is all I can think after reading your manuscript. Even if it turns out we are playing out this touching scene (which never fails to bring tears to my eyes) — no.

(SMBC via Lost in Transcription)

Update: Thanks to reader R.K. for correcting my Yiddish.

Learned Folly; Physics; Enigmas of Chance; Complexity

Posted by crshalizi at March 21, 2012 11:49 | permanent link

March 20, 2012

Fun with Density Estimation (Advanced Data Analysis from an Elementary Point of View)

Homework 7: A little theory, a little methodology, a little data analysis: these keep growing young statisticians healthily balanced.

assignment, n90_pol.csv data

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 20, 2012 10:31 | permanent link

Simulation (Advanced Data Analysis from an Elementary Point of View)

Simulation: implementing the story encoded in the model, step by step, to produce something data-like. Stochastic models have random components and so require some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random variables. How to generate random variables with specified distributions. Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? Simulation-based estimation: the method of simulated moments.

Reading: Notes, chapter 16; R

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 20, 2012 10:30 | permanent link

March 15, 2012

Milestones

My paper with Aaron Clauset and Mark Newman on power laws has just passed 1000 citations on Google Scholar, slightly ahead of schedule. (Actually, the accuracy of Aaron's prediction is a little creepy.)

I am spending the day reading over my student Daniel McDonald's dissertation draft. The calendar tells me that I was in the middle of writing up my own dissertation in mid-March 2001. But this is impossible, since I could swear that was just a few months ago at most, not eleven years.

Most significant of all, one of my questions has been answered by Guillaume the adaptationist goat.

Self-Centered; Kith and Kin

Posted by crshalizi at March 15, 2012 11:00 | permanent link

March 08, 2012

Density Estimation (Advanced Data Analysis from an Elementary Point of View)

The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties: convergence on the true density if the bandwidth shrinks at the right rate; superior performance to histograms; the curse of dimensionality again. An example with cross-country economic data. Kernels for discrete variables. Estimating conditional densities; another example with the OECD data. Some issues with likelihood, maximum likelihood, and non-parametric estimation.

Reading: Notes, chapter 15

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 08, 2012 10:30 | permanent link

March 06, 2012

Multivariate Distributions (Advanced Data Analysis from an Elementary Point of View)

Reminders about multivariate distributions. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; conditional distributions in multivariate Gaussians and linear regression; computational aspects, specifically in R. General methods for estimating parametric distributional models in arbitrary dimensions: moment-matching and maximum likelihood; asymptotics of maximum likelihood; bootstrapping; model comparison by cross-validation and by likelihood ratio tests; goodness of fit by the random projection trick.

Reading: Notes, chapter 14

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 06, 2012 09:25 | permanent link

March 01, 2012

GLM and GAM Examples (Advanced Data Analysis from an Elementary Point of View)

Building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.

Reading: Notes, second half of chapter 13; Faraway, chapters 6 and 7

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 01, 2012 10:30 | permanent link

February 29, 2012

Books to Read While the Algae Grow in Your Fur, February 2012

Attention conservation notice: I have no taste.

Jennifer Safrey, Tooth and Nail
Mind candy: portrait of the tooth fairy as amateur boxer (and pollster!).
Jennifer Crusie, Welcome to Temptation and Faking It
Mind candy. "First, get the mark to smile..."
Thomas W. Young, Silent Enemy
Mind candy; thriller based on the author's experiences as a US military cargo pilot. There was something oddly compelling about the self-consciousness about being part of a vast and complicated system of humans and machines. I can't decide if it's rebuttal of ideas about technological alienation, an exemplification, or just an acceptance that this-is-what-it's-like.
Steven Berlin Johnson, Everything Bad Is Good for You: How Today's Popular Culture Is Actually Making Us Smarter
Johnson has a couple of these in this book. (1) Popular culture products — by which he overwhelmingly means American popular culture — has become more cognitively complex and demanding over the post-war decades, though he puts most of his emphasis on what's happened since 1970, and especially since 1980. (2) Comparing products of equivalent levels of popularity and (in some sense) quality, popular culture is at least as complex as it's been since the invention of the mass audience, and quite likely much more complex. (It makes no sense to compare cheap schlock movies of 2000 to the artistic peaks of 1970; you need to compare schlock to schlock.) (3) This increase in the complexity of the culture surrounding us, and which we use to entertain ourselves, drives the Flynn Effect.
I find points (1) and (2) pretty convincing, though this is necessarily impressionistic. (The thing which actually gives me the most pause is Johnson's forthright admission that the features which make shows like The Sporanos or The Wire demanding — huge casts of characters with multiple interacting plot threads extending over multiple episodes, or even over many years — have been parts of soap operas since time out of mind.) His description of what it's like to actually play modern videogames, for instance, is remarkably persuasive.
Point (3), however, is something else again. The possibility he doesn't give enough attention to is that the causal arrow points the other way. Suppose, for whatever reason, the kinds of habit of thought Johnson is talking about have become more widely distributed. This increases the size of the potential audience which could enjoy complex entertainments — and, perhaps more importantly, shrinks the pool of those who wouldn't find simpler ones boring. Makers seeking audiences then shift accordingly, accommodating an exogenous change in the audience. (Analogously, pulp fiction didn't teach people to read; it was a response to the innovations of mass literacy and cheap printing.) One would have to look elsewhere for an explanation of the Flynn Effect — the cumulative impact of generations of soap-opera-consuming mothers? — but I suspect that's true anyway.
Anyway, recommended if you care about these issues, or just like unusually intelligent (not clever) cultural criticism.
Saladin Ahmed, Throne of the Crescent Moon
Mind candy. This review on Tor.com is pretty good, and the excerpt on Ahmed's website is representative.
ObLinkage: Ahmed's self-presentation.
John Billheimer, The Contrary Blues
Mind candy. First book in the mystery series where I started with no. 2, Highway Robbery. A bit shakier writing than that one, but still enjoyable.
Jianqing Fan and Qiwei Yao, Nonlinear Time Series: Nonparametric and Parametric Methods
Full-length review: Everyone Their Own Oracle.
Shorter me: Modern non-parametric time series methods are immensely more powerful than Good Old-Fashioned ARMA-mongering. Fan and Yao have provided a nice introduction to time series for people who know non-parametric statistics; I doubt it will be helpful to ARMA-mongers. Given a choice to serve one or audience or the other, I'd have made the same choice, but it does contribute to science advancing, in this field, funeral by funeral.
Walter Jon Williams, The Fourth Wall
Sequel to This Is Not a Game and Deep State, in which Dagmar Shaw and co. take Hollywood; which is of course a cover for a much deeper game. (I think I can avoid spoilers in what follows; let's see.) Unlike the previous books, the viewpoint character is not Dagmar, but a former child actor, now adult and desperate to do anything to get back into the limelight. (In the first chapter we see just how desperate Sean is.) Sean, as the narrator, is intelligent but also profoundly indifferent to everything outside the little world of Hollywood movie-making, which leads to an interesting skewing of perspective on the events he witnesses. The reader who gets to the end will see what has gone before in a very different light.
While this makes it possible to enjoy this book without having read the previous ones (without "in our last thrilling episode" exposition), those of us who have read those books will realize that Sean is mis-understanding what he sees, and be tantalized by the sense that the story he is telling us is peripheral to something with much higher stakes. It's an interesting choice on Williams's part, but it did leave me wanting more Dagmar.
ObLinkage: Williams's self-presentation.
Jay Lake, Endurance
Sequel to Green, in which she continues her run-ins with the Powers That Be — as well as Powers That Were, and Powers Yet to Come. The ending promises a sequel, which I very much want.
(A propos of my complaining about the melanin-depletion of the cover of Green, Lake's preface here indicates that he's OK with that, which is his right. So I will switch to just complaining that Green, as depicted here, not only does not look five months pregnant, but will never show up on Women Fighters in Reasonable Armor.)

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Enigmas of Chance; Minds, Brains, and Neurons; The Commonwealth of Letters; Commit a Social Science

Posted by crshalizi at February 29, 2012 23:59 | permanent link

February 28, 2012

Exam 1: Diabetes (Advanced Data Analysis from an Elementary Point of View)

In which we practice our art upon the condition formerly known as juvenile diabetes.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 28, 2012 10:31 | permanent link

Generalized Linear Models and Generalized Additive Models (Advanced Data Analysis from an Elementary Point of View)

Iteratively re-weighted least squares for logistic regression re-examined: coping with nonlinear transformations and model-dependent heteroskedasticity. The common pattern of generalized linear models and IRWLS. Binomial and Poisson regression. The extension to generalized additive models.

Reading: Notes, first half of chapter 13; Faraway, section 3.1, chapter 6

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 28, 2012 10:30 | permanent link

February 23, 2012

Logistic Regression (Advanced Data Analysis from an Elementary Point of View)

Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model; maximum likelihood; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares; comparing logistic regression to logistic-additive models.

Reading: Notes, chapter 12; Faraway, chapter 2 (skipping sections 2.11 and 2.12)

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 23, 2012 10:30 | permanent link

February 21, 2012

What Makes the Union Strong? (Advanced Data Analysis from an Elementary Point of View)

In which we examine the fate of the organized working class, by way of review for the midterm.

Assignment, strikes.csv data set

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 21, 2012 10:30 | permanent link

February 19, 2012

"From Data to Knowledge: Machine-Learning with Real-time & Streaming Applications" (Dept. of Signal Amplification)

Attention conservation notice: Intellectuals gathering in Berkeley to argue about "knowledge" and "revolution".

This looks like fun, and if I didn't have conflicting obligations I'd definitely be there.

From Data to Knowledge: Machine-Learning with Real-time & Streaming Applications

May 7-11 2012
On the Campus of the University of California, Berkeley

We are experiencing a revolution in the capacity to quickly collect and transport large amounts of data. Not only has this revolution changed the means by which we store and access this data, but has also caused a fundamental transformation in the methods and algorithms that we use to extract knowledge from data. In scientific fields as diverse as climatology, medical science, astrophysics, particle physics, computer vision, and computational finance, massive streaming data sets have sparked innovation in methodologies for knowledge discovery in data streams. Cutting-edge methodology for streaming data has come from a number of diverse directions, from on-line learning, randomized linear algebra and approximate methods, to distributed optimization methodology for cloud computing, to multi-class classification problems in the presence of noisy and spurious data.

This conference will bring together researchers from applied mathematics and several diverse scientific fields to discuss the current state of the art and open research questions in streaming data and real-time machine learning. The conference will be domain driven, with talks focusing on well-defined areas of application and describing the techniques and algorithms necessary to address the current and future challenges in the field.

Sessions will be accessible to a broad audience and will have a single track format with additional rooms for breakout sessions and posters. There will be no formal conference proceedings, but conference applicants are encouraged to submit an abstract and present a talk and/or poster.

See the conference page for submission details, schedules, etc.

Via conference organizer and CMU alumnus Joey Richards.

Enigmas of Chance; Signal Amplification

Posted by crshalizi at February 19, 2012 12:44 | permanent link

Talks Next Week

Attention conservation notice: Only of interest if you (1) like hearing people talk about statistics and machine learning, and (2) will be in Pittsburgh next week.

I have been remiss about advertising upcoming talks.

Mark Davenport, "To Adapt or Not To Adapt: The Power and Limits of Adaptivity for Sparse Estimation"
Abstract: In recent years, the fields of signal processing, statistical inference, and machine learning have come under mounting pressure to accommodate massive amounts of increasingly high-dimensional data. Despite extraordinary advances in computational power, the data produced in application areas such as imaging, remote surveillance, meteorology, genomics, and large scale network analysis continues to pose a number of challenges. Fortunately, in many cases these high-dimensional signals contain relatively little information compared to their ambient dimensionality. For example, signals can often be well-approximated as sparse in a known basis, as a matrix having low rank, or using a low-dimensional manifold or parametric model. Exploiting this structure is critical to any effort to extract information from such data.
In this talk I will overview some of my recent research on how to exploit such models to recover high-dimensional signals from as few observations as possible. Specifically, I will primarily focus on the problem of estimating a sparse vector from a small number of noisy measurements. To begin, I will consider the case where the measurements are acquired in a nonadaptive fashion. I will establish a lower bound on the minimax mean-squared error of the recovered vector which very nearly matches the performance of $\ell1$-minimization techniques, and hence shows that these techniques are essentially optimal. I will then consider the case where the measurements are acquired sequentially in an adaptive manner. I will prove a lower bound that shows that, surprisingly, adaptivity does not allow for substantial improvement over standard nonadaptive techniques in terms of the minimax MSE. Nonetheless, I will also show that there are important regimes where the benefits of adaptivity are clear and overwhelming.
Time and place: 4--5 pm on Monday, 20 February 2012, in Scaife Hall 125
Ambuj Tewari, "From Probabilistic to Game Theoretic Foundations for Learning and Prediction"
Abstract: The probabilistic approach to prediction problems assumes that the data is generated from an underlying stochastic process. A reasonable goal then is to minimize the expected loss, or risk. The game theoretic approach, in contrast, views prediction as a repeated game between the learner and an adversary. The learner's goal then is to do well no matter what strategy is followed by the adversary. Minimizing regret is one of the well known ways to operationalize the notion of doing well. With a long history in varied disciplines such as Computer Science, Economics, Information Theory, and Statistics, the game theoretic approach has witnessed a vigorous development. Yet the suite of standard tools available for the probabilistic setting, such as Rademacher & Gaussian averages, covering numbers, and combinatorial dimensions, was missing in the game theoretic setting. In this talk, I will show how it is indeed possible to develop analogues of these tools for the game theoretic setting. Unlike the probabilistic setting, where empirical risk minimization is a canonical algorithm, we will not be able to exhibit a corresponding canonical algorithm for the game theoretic setting. However, under the additional assumption of convexity, I will show that Mirror Descent, a classic algorithm from optimization theory, is a canonical algorithm achieving minimax regret rates.
(Talk is based on papers written jointly with Alexander Rakhlin, Nathan Srebro, and Karthik Sridharan.)
Time and place: 10--11 am on Wednesday, 22 February 2012, in Gates Hall 6115
Forrest W. Crawford, "Birth, Death, Sex, Lies: Markov Counting Processes in Genetics and Beyond"
Abstract: A general birth-death process (BDP) is a continuous-time Markov chain that counts the number of particles in a system over time. At any moment in time, a particle may give birth or die, and the rate at which these events occur depends on the number of particles in the system at that time. While widely used in population biology, genetics, and evolution, statistical inference techniques for general BDPs remain elusive. In fact, the likelihood of a discrete observation from many of these processes cannot be written in closed form. In this talk, I outline several fundamental results that allow computation of transition probabilities and maximum likelihood estimates for general BDPs. I apply these novel methods to three important applied problems. First, I describe a technique for determining the effect of antibody treatment on the growth of lymphoma cells in vitro. Second, I investigate the evolution of DNA microsatellites in humans and chimpanzees using a log-linear model for the rates of repeat duplication and deletion. Finally, I use a BDP to infer true counts of sex acts from rounded self-reported counts in a longitudinal study of risky behaviors in young people living with HIV. These applications illustrate the mathematical, statistical, and computational challenges involved in learning from BDPs in biology, medicine, and public health.
Time and place: 4--5 pm on Wednesday, 22 February 2012, in Scaife Hall 125
Ron Bekkerman, "Scaling Up Machine Learning"
Abstract: In this talk, I'll provide an extensive introduction to parallel and distributed machine learning. I'll answer the questions "How actually big is the big data?", "How much training data is enough?", "What do we do if we don't have enough training data?", "What are platform choices for parallel learning?" etc. Over an example of k-means clustering, I'll discuss pros and cons of machine learning in Apache Pig, MPI, DryadLINQ, and CUDA. Time permitting, I'll take a dive into a super large scale text categorization task.
Time and place: 1:30--2:30 pm on Thursday, 23 February 2012, in Newell-Simon Hall 1305

As always, the talks are free and open to the public.

(You see why I have trouble keeping up with these.)

Enigmas of Chance

Posted by crshalizi at February 19, 2012 12:30 | permanent link

February 15, 2012

How the North American Mammalian Paleofauna Got a Crook in Its Curve (Advanced Data Analysis from an Elementary Point of View)

In which extinct charismatic megafauna give us an excuse to practice basic programming, bootstrapping, and specification testing.

Assignment, R

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 15, 2012 14:15 | permanent link

Testing Regression Specifications (Advanced Data Analysis from an Elementary Point of View)

Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.

Reading: Notes, chapter 10

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 15, 2012 14:10 | permanent link

Writing R Code (Advanced Data Analysis from an Elementary Point of View)

A change to the lecture schedule, by popular demand!

R programs are built around functions: pieces of code that take inputs or arguments, do calculations on them, and give back outputs or return values. The most basic use of a function is to encapsulate something we've done in the terminal, so we can repeat it, or make it more flexible. To assure ourselves that the function does what we want it to do, we subject it to sanity-checks, or "write tests". To make functions more flexible, we use control structures, so that the calculation done, and not just the result, depends on the argument. R functions can call other functions; this lets us break complex problems into simpler steps, passing partial results between functions. Programs inevitably have bugs: debugging is the cycle of figuring out what the bug is, finding where it is in your code, and fixing it. Good programming habits make debugging easier, as do some tricks. Avoiding iteration. Re-writing code to avoid mistakes and confusion, to be clearer, and to be more flexible.

Reading: Notes, chapter 9

Optional reading: Slides from 36-350, introduction to statistical computing, especially through lecture 15.

R for in-class demos (based around the previous problem set)

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 15, 2012 14:05 | permanent link

Cozy Catastrophes

Attention conservation notice: Academics with blogs quibbling about obscure corners of applied statistics.

Lurkers in e-mail point me to this pushback against the general pushback against power laws, and ask me to comment. It might be a mistake to do so, but I'm feeling under the weather and so splenetic, so I will.

In our paper, we looked at 24 quantities which people claimed showed power law distributions. Of these, there were seven cases where we could flat-out reject a power law, without even having to consider an alternative, because the departures of the actual distribution from even the best-fitting power law was much too large to be explained away as fluctuations. (One of the wonderful thing about a stochastic model is that it tells you how big its own errors should be.) In contrast, there was only one data set where we could rule out the log-normal distribution.

In some of those cases, you can patch things up, sort of, by replacing a pure power law with a power-law with an exponential cut-off. That is, rather than the probability density being proportional to x-a, it's proportional to x-ae-x/L. (Either way, I am only talking about the probability density in the "right tail", i.e., for x above some xmin.) This gives the infamous straight-ish patch on a log-log plot, for values of x much smaller than L, but otherwise it has substantially different properties. In ten of the twelve cases we looked at, the only way to save the idea of a power-law at all is to include this exponential cut-off. But that exponentially-shrinking factor is precisely what squelches the WTF, X IS ELEVENTY TIMES LARGER THAN EVER! THE BIG ONE IS IN OUR BASE KILLING OUR DOODZ!!!!1!! mega-events. There were ten more cases where we judged the support for power laws as "moderate", meaning "the power law is a good fit but that there are other plausible alternatives as well" (pardon the self-quotation.) Again, those alternatives, like log-normals and stretched exponentials, give very different tail-behavior, with not so much OMG DOOM.

We found exactly one case where the statistical evidence for the power-law was "good", meaning that "the power law is a good fit and that none of the alternatives considered is plausible", which was Zipf's law of word frequency distributions. We were of course aware that when people claim there are power laws, they usually only mean that the tail follows a power law. This is why all these comparisons were about how well the different distributions fit the tail, excluding the body of the data. We even selected where "the tail" begins to maximize the fit to a power law for each case. Even so, there was just this one case where the data compelling support a power law tail.

(All of this — the meaning of "with cut-off", the meaning of our categorizations, the fact that we only compare the tails, etc. — is clear enough from our paper, if you actually read the text. Or even just the tables and their captions.)

I bring up the OMG DOOM because some people, Hanson very much included, like to extrapolate from supposed power laws for various Bad Things to scenarios where THE BIG ONE kills off most of humanity. But, at least with the data we found, the magnitudes of forest fires, solar flares, earthquakes and wars were all better fit by log-normals, by stretched exponentials and by cut-off power laws than by power laws. For fires, flares and quakes, the differences are large enough that they clearly fall into the "with cut-off only" category. The differences in fits for the war-death data are smaller, as (mercifully) is the sample size, so we put it in the "moderate" support category. If you had some compelling other reason to insist on a power law rather than (e.g.) a log-normal there, the data wouldn't slap you down, but they wouldn't back you up either.

Now, I relish the schadenfreude-laden flavors of a mega-disaster scenario as much as the next misanthropic, science-fiction-loving geek, especially when it's paired with some "The fools! Can't they follow simple math?" on the side. Truly, I do. But squeezing that savory, juicy DOOM out of (for instance) the distribution of solar flares relies on the shape of the tail, i.e., whether it's a pure power law or not. The weak support, in the data, for such powers law means you don't really have empirical evidence for your scenarios, and in some cases what evidence there is tells against them. It's a free country, so you can go on telling those stories, but don't pretend that they owe more to confronting hard truths than to literary traditions.

Power Laws

Posted by crshalizi at February 15, 2012 14:00 | permanent link

February 13, 2012

Of Variance Explained; or, Chronicles of Deaths Smoothed

Attention conservation notice: 1500 word pedagogical-statistical rant, with sarcasm, mathematical symbols, computer code, and a morally dubious affectation of detachment from the human suffering behind the numbers. Plus the pictures are boring.
Does anyone know when the correlation coefficient is useful, as opposed to when it is used? If so, why not tell us?
— Tukey (1954: 721)

If you have taken any sort of statistics class at all, you have probably been exposed to the idea of the "proportion of variance explained" by a regression, conventionally written R2. This has two definitions, which happen to coincide for linear models fit by least squares. The first is to take the correlation between the model's predictions and the actual values (R) and square it (R2), getting a number which is guaranteed to be between 0 and 1. You get 1 only when the predictions are perfectly correlated with reality, and 0 when there is no linear relationship between them. The other definition is the ratio of the variance of the predictions to the variance of the actual values. It is this latter which leads to the notion that R2 is the proportion of variance explained by the model.

The use of the word "explained" here is quite unsupported and often actively misleading. Let me go over some examples to indicate why.

Start by supposing that a linear model is true:

Y = a + bX + noise
where the noise has constant variance s, and is uncorrelated with X. Suppose that we know this is the model to use, and suppose further that, as a reward for our scrupulous peer-review of anonymous manuscripts, the Good Fairy of Statistical Modeling tells us the correct values of the parameters a and b. Surely, with the right parameters in the right model, our R2 must be very high?

Well, no. The answer depends on the variance of X, which it will be convenient to call v. The variance of the predictions is b2 v, but the variance of Y is larger, b2 v + s. The ratio is \[ R^2 = \frac{b^2 v}{b^2v + s} \] (You can check that this is also the squared correlation between the predictions and Y.) As v shrinks, this tends 0/s = 0. As v grows, this tends to 1. The relationship between X and Y doesn't change, the accuracy and precision with which Y can be predicted from X do not change, but R2 can wander all through its range, just depending on how dispersed X is.

Now, you say, this is a silly algebraic curiosity. Never mind the Good Fairy of Statistical Modeling handing us the correct parameters, let's talk about something gritty and real, like death in Chicago.

Number of deaths each day in Chicago, 1 January 1987--31 December 2000, from all causes except accidents. (Click this and all later figures for larger PDF versions. See below for link to code.)

I can relate deaths to time in any number of ways; the next figure shows what I get when I use a smoothing spline (and use cross-validation to pick how much smoothing to do). The statistical model is

death = f0(date) + noise
with f0 being a function learned from the data.
As before, but with the addition of a smoothing spline.

The root-mean-square error of the smoothing spline is just above 12 deaths/day. The R2 of the fit is either 0.35 (squared correlation between predicted and actual deaths) or 0.33 (variance of predicted deaths over variance of actual deaths). It seems absurd, however, to say that the date explains how many people died in Chicago on a given day, or even the variation from day to day. The closest I can come up with to an example of someone making such a claim would be an astrologer, and even one of them would work in some patter about the planets and their influences. (Numerologists, maybe? I dunno.)

Worse is to follow. The same data set which gives me these values for Chicago includes other variables, such as the concentration of various atmospheric pollutants and temperature. I can fit an additive model, which tries to tease out the separate relationships between each of those variables and deaths in Chicago, without presuming a particular functional form for each relationship. In particular I can try the model

deaths = f1(sulfur dioxide) + f2(particulates) + f3(temperature, ozone) + noise
where the functions f1, f2 and f3 are all learned from data. (Exercise: why do I do a joint smoothing against temperature and ozone?) When I do that, I get functions which look like the following.
Estimated partial response functions for concentration of sulfur dioxide, concentration of particulates, and (jointly) temperature and concentration of ozone, all taken as averages over four-day moving windows.

The R2 of this model is 0.27. Is this "variance explained"? Well, it's at least not incomprehensible to talk about changes in temperature or pollution explaining changes in mortality. In fact, adding this model's predictions to the simple spline's, we see that most of what the spline predicted from the date is predictable from pollution and temperature:

Black dots: actual death counts. Red curve: spline smoothing on the date alone. Blue lines: predictions from the temperature-and-pollution model.
But notice it is not anything in the math or the statistics which tells us that this a step closer to something we might, unblushingly, call an "explanation". The astrologer, after all, could look at this figure the other way, and say that really pollution and temperature are just crude proxies for the position of Mars (or whatever).

We could, in fact, try to include the date in this larger model:

deaths = f0(date) + f1(sulfur dioxide) + f2(particulates) + f3(temperature, ozone) + noise
Of course, we have to re-estimate all the functions, but as it turns out they don't change very much. (I'd show you the plot of the fitted values over time as well, but visually it's almost indistinguishable from the last one.)

Despite the lack of visual drama, putting a smooth function of time back into the model increases R2, from 0.27 to 0.30. Formally, the date enters into the model in exactly the same way as particulate pollution. But, again, only a fortune teller — an unusually numerate fortunate teller, perhaps a subscriber to the Journal of Evidence-Based Haruspicy — would say that the date explains, or helps explain, 3% of the variance.

I hope that by this point you will at least hesitate to think or talk about R2 as "the proportion of variance explained". (I will not insist on your never talking that way, because you might need to speak to the deluded in terms they understand.) How then should you think about it? I would suggest: the proportion of variance retained, or just kept, by the predictions. Linear regression is a smoothing method. (It just smoothes everything on to a line, or more generally a hyperplane.) It's hard for any smoother to give fitted values which have more variance than the variable it is smoothing. R2is merely the fraction of the target's variance which is not smoothed away.

This of course raises the question of why you'd care about this number at all. If prediction is your goal, then it would seem much more natural to look at mean squared error. (Or really root mean squared error, so it's in the same units as the variable predicted.) Or mean absolute error. Or median absolute error. Or a genuine loss function. If on the other hand you want to get some function right, then your question is really about mis-specification, and/or confidence sets of functions, and not about whether your smoother is following every last wiggle of the data at all. If you want an explanation, the fact that there is a peak in deaths every year of about the same height, but the predictions fall short of it, suggests that this model is missing something. The fact that the data shows something awful happened in 1995 and the model has nothing adequate to say about it suggests that whatever's missing is very important.

Code for reproducing the figures and analyses in R. (I make this public, despite the similarity of this exercise to the last problem-set in advanced data analysis, because (i) it's not exactly the same, (ii) the homework is due in ten hours, (iii) none of my students would dream of copying this and turning it in as their own, and (iv) I borrowed the example from Simon Wood's Generalized Additive Models.)

Manual trackback: Bob O'Hara; Siris

Enigmas of Chance

Posted by crshalizi at February 13, 2012 23:54 | permanent link

Power Law News

1. I'd like to say that you have no idea how long I have waited to read something like this piece by Michael Stumpf and Mason Porter in one of the glossy journals. But that would be a lie, because if you've been reading this for any length of time, you know that the answer is, long enough to be very tiresome about it. If the referees, and still more the editors, at those journals can be persuaded to pay attention, we will be on track for my mid-2007 hope that "in five to ten years even science journalists and editors of Wired will begin to get the message." (I never really had any hopes for Wired.)

2. You can imagine how my heart sank to see that Krugman had a post titled "The Power (Law) of Twitter" — and my relief to see that he's not actually saying that the distribution of followers is a power law. It is however interesting that the distribution is so close to a log-normal.

3. My ex-boss and mentor Melanie Mitchell has a blog, and promises a substantive series of posts on power laws and scaling. In the meanwhile, go read her book.

Update, 15 February: see later post.

Manual trackback: Brendan O'Connor

(Nos. 1 and 2 via too many to list.)

Power Laws

Posted by crshalizi at February 13, 2012 20:40 | permanent link

February 09, 2012

Additive Models (Advanced Data Analysis from an Elementary Point of View)

The "curse of dimensionality" limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimensionality. Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This generalizes linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Examples in R using the California house-price data. Conclusion: there are no statistical reasons to prefer linear models to additive models, hardly any scientific reasons, and increasingly few computational ones; the continued thoughtless use of linear regression is a scandal.

Reading: Notes, chapter 8; Faraway, chapter 12

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 09, 2012 10:30 | permanent link

February 07, 2012

It's Not the Heat that Gets to You, It's the Sustained Conjunction of Heat with Elevated Levels of Atmospheric Pollutants (Advanced Data Analysis from an Elementary Point of View)

In which spline regression becomes a matter of life and death in Chicago.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 07, 2012 10:31 | permanent link

Splines (Advanced Data Analysis from an Elementary Point of View)

Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression.

Reading: Notes, chapter 7; Faraway, section 11.2.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 07, 2012 10:30 | permanent link

February 02, 2012

Heteroskedasticity, Weighted Least Squares, and Variance Estimation (Advanced Data Analysis from an Elementary Point of View)

Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.

Reading: Notes, chapter 6; Faraway, section 11.3.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 02, 2012 10:30 | permanent link

January 31, 2012

Books to Read While the Algae Grow in Your Fur, January 2012

Attention conservation notice: I have no taste.

Stephen Greenblatt, The Swerve: How the World Became Modern
A rather rambling and formless, if amiable and enthusiastic, popular history of Lucretius's De Rerum Natura and its rediscovery during the Renaissance. The grandiosity of the subtitle is not, thankfully, insisted upon in the text, which in fact says rather little about the quite interesting history of how Lucretius was taken up, and Epicurean ideas were elaborated on, in early modern Europe. Passages of novelistic you-are-there detail, which Greenblatt admits are totally made up, are mercifully brief and fairly clearly marked as such. (Such claims of influence as he does make strike me as very thinly supported, though not clearly wrong.) Enjoyable, if slight, if you are prepared to care very deeply about books, and to sympathize with philosophical materialism.
(I am not sure why Greenblatt writes that the only manuscripts we have from the ancient world are those from Herculaneum preserved by the eruption of Mt. Vesuvius. In Egypt and other desert countries, manuscripts have survived from Roman, Ptolemaic and even earlier times, some of them rather famous. But he is not a classicist, and one hopes he is a bit more careful about his own period.)
Margaret C. Jacob, Strangers Nowhere in the World: The Rise of Cosmopolitanism in Early Modern Europe
On the positive side, the subject is important, and there were lots of interesting anecdotes and suggestions. Against that, it is far too scatter-shot and lacks not only a single global argument, but even much cohesion within individual chapters. It is also far too limited in scope, to the Enlightenment and its immediate predecessors in the 17th century. But if one wanted to look even at what was distinctive about that sort of cosmopolitanism, it's very strange to not even try to compare it to Latinate humanism and earlier medieval traditions, or the way the travels of learned artists spread styles and ideas during the Renaissance and before. (Comparison with any other part of the world is of course too much to expect of a Europeanist, even one interested in cosmopolitanism.) Finally, Jacob makes causal claims — e.g., that alchemical ideas in early-modern natural philosophy were displaced by mechanical ones because the latter were less politically troubling to monarchies — with a sweep and assurance totally out of proportion to anything she presents by way of evidence or argument. Over-all of little value to me, but perhaps of more use to specialists in the period.
Amar Bhidé, A Call for Judgment: Sensible Finance for a Dynamic Economy
Full-length review: Hayek contra Chicago.
Rachel Loden, Dick of the Dead
Not as good as her superb Hotel Imperium, but still great:
The Idiad

Shall I write a poem about you
And your epic struggle against stupidity?
Feh. But if the brain is a city
I too have rooms in the swampy part, surrounded by crocodiles.
The monarch butterflies sail down from the Canadian Rockies
To overwinter in Pacific Grove, pair off and fly away;
They bruise me. I get crankier.
If you are coming down through the narrows of the Saugatuck
Please text me beforehand,
And I will come out to meet you
As far as Palookaville.

Gerda Claeskens and Nils Lid Hjort, Model Selection and Model Averaging
Full-length review: How Can You Choose Just One?.
Shorter me: the best available review of model selection from a statistical standpoint. Presumes a reader with some knowledge of asymptotic statistics.
Shirley Jackson, The Haunting of Hill House
Exactly as good, as monstrous, and as ambiguous, as I remember it (unlike The Sundial). One mark of its excellence is that its things that go bump in the night are perfectly convincing, and yet the real horrors are all those of the all-too-human mind. I am not sure what point there is to other haunted house stories, really.
ObLinkage: Kit Whitfield on the first paragraph of the novel. Whitfield is exactly right about the way "small, unnerving echoes whisper back and forth along her pages". (Take, please take, the ending, for example.)
Patrick O'Brian, The Letter of Marque; The Thirteen Gun Salute; The Nutmeg of Consolation; Clarissa Oakes / The Truelove
Books to Read While the Algae Grow in Your Fur; Writing for Antiquity; The Great Transformation; The Commonwealth of Letters; Scientifiction and Fantastica; Enigmas of Chance; The Dismal Science

Posted by crshalizi at January 31, 2012 23:59 | permanent link

How the Hyracotherium Got Its Mass (Advanced Data Analysis from an Elementary Point of View)

In which we consider evolutionary trends in body size, aided by regression modeling and the bootstrap.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 31, 2012 19:11 | permanent link

The Bootstrap (Advanced Data Analysis from an Elementary Point of View)

Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping. Non-parametric bootstrapping. Many examples. When does the bootstrap fail?

Reading: Notes, chapter 5 (R for figures and examples; pareto.R; wealth.dat)<; R for in-class examples

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 31, 2012 19:10 | permanent link

You think you want big data? You can't handle big data! (Next Week at the Statistics Seminar)

Fortunately, however, the methods of those who can handle big data are neither grotesque nor incomprehensible, and we will hear about them on Monday.

Alekh Agarwal, "Computation Meets Statistics: Trade-offs and Fundamental Limits for Large Data Sets"
Abstract: The past decade has seen the emergence of datasets of unprecedented scale, with both large sample sizes and dimensionality. Massive data sets arise in various domains, among them computer vision, natural language processing, computational biology, social networks analysis and recommendation systems, to name a few. In many such problems, the bottleneck is not just the number of data samples, but also the computational resources available to process the data. Thus, a fundamental goal in these problems is to characterize how estimation error behaves as a function of the sample size, number of parameters, and the computational budget available.
In this talk, I present three research threads that provide complementary lines of attack on this broader research agenda: (i) lower bounds for statistical estimation with computational constraints; (ii) interplay between statistical and computational complexities in structured high-dimensional estimation; and (iii) a computational budgeted framework for model selection. The first characterizes fundamental limits in a uniform sense over all methods, whereas the latter two provide explicit algorithms that exploit the interaction of computational and statistical considerations.
Joint work with John Duchi, Sahand Negahban, Clement Levrard, Pradeep Ravikumar, Peter Bartlett, and Martin Wainwright.
Time and place: 4--5 pm on Monday, 6 February 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Posted by crshalizi at January 31, 2012 19:00 | permanent link

"The Cut and Paste Process" (This Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about combinatorial stochastic processes and their statistical applications, and (2) will be in Pittsburgh on Wednesday afternoon.

It is only in very special weeks, when we have been very good, that we get two seminars.

Harry Crane, "The Cut-and-Paste Process"
Abstract: In this talk, we present the cut-and-paste process, a novel infinitely exchangeable process on the state space of partitions of the natural numbers whose samples paths differ from previously studied exchangeable coalescent (Kingman 1982; Pitman 1999) and fragmentation (Bertoin 2001) processes. Though it evolves differently, the cut-and-paste process possesses some of the same properties as its predecessors, including a unique equilibrium measure, associated measure-valued process, a Poisson point process construction and transition probabilities which can be described in terms of Kingman's paintbox process. A parametric subfamily is related to the Chinese restaurant process and we illustrate potential applications of this model to phylogenetic inference based on RNA/DNA sequence data. There are some natural extensions of this model to Bayesian inference, hidden Markov models and tree-valued Markov processes which we will discuss.
We also discuss how this process and its extensions fit into the more general framework of statistical modeling of structure and dependence via combinatorial stochastic processes, e.g. random partitions, trees and networks, and the practical importance of infinite exchangeability in this context.
Time and place: 4--5 pm on Wednesday, 1 February 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at January 31, 2012 18:45 | permanent link

January 28, 2012

Scientific Community to Elsevier: Drop Dead

Attention conservation notice: Associate editor at a non-profit scientific journal endorses a call for boycotting a for-profit scientific journal publisher.

I have for years been refusing to publish in or referee for journals publisher by Elsevier; pretty much all of the commercial journal publishers are bad deals1, but they are outrageously worse than most. Since learning that Elsevier had a business line in putting out publications designed to look like peer-reviewed journals, and calling themselves journals, but actually full of paid-for BS, I have had a form letter I use for declining requests to referee, letting editors know about this, and inviting them to switch to a publisher which doesn't deliberately seek to profit by corrupting the process of scientific communication.

I am thus extremely happy to learn from Michael Nielsen that Tim Gowers is organizing a general boycott of Elsevier, asking people to pledge not to contribute to its journals, referee for them, or do editorial work for them. You can sign up here, and I strongly encourage you to do so. There are fields where Elsevier does publish the leading journals, and where this sort of boycott would be rather more personally costly than it is in statistics, but there is precedent for fixing that. Once again, I strongly encourage readers in academia to join this.

(To head off the inevitable mis-understandings, I am not, today, calling for getting rid of journals as we know them. I am saying that Elsevier is ripping us off outrageously, that conventional journals can be published without ripping us off, and so we should not help Elsevier to rip us off.)

Disclaimer, added 29 January: As I should have thought went without saying, I am speaking purely for myself here, and not with any kind of institutional voice. In particular, I am not speaking for the Annals of Applied Statistics, or for the IMS, which publishes it. (Though if the IMS asked its members to join in boycotting Elsevier, I would be very happy.)

1: Let's review how scientific journals work, shall we? Scientists are not paid by journals to write papers: we do that as volunteer work, or more exactly, part of the money we get for teaching and from research grants is supposed to pay for us to write papers. (We all have day-jobs.) Journals are edited by scientists, who volunteer for this and get nothing from the publisher. (New editors get recruited by old editors.) Editors ask other scientists to referee the submissions; the referees are volunteers, and get nothing from the publisher (or editor). Accepted papers are typeset by the authors, who usually have to provide "camera-ready" copy. The journal publisher typically provides an electronic system for keeping track of submitted manuscripts and the refereeing process. Some of them also provide a minimal amount of copy-editing on accepted papers, of dubious value. Finally, the publisher actually prints the journal, and runs the server distributing the electronic version of the paper, which is how, in this day and age, most scientists read it. While the publisher's contribution isn't nothing, it's also completely out of proportion to the fees they charge, let alone economically efficient pricing. The whole thing would grind to a halt without the work done by scientists, as authors, editors and referees. That work, to repeat, is paid for either by our students or by our grants, not by the publisher. This makes the whole system of for-profit journal publication economically insane, a check on the dissemination of knowledge which does nothing to encourage its creation. Elsevier is simply one of the worst of these parasites.

Manual trackback: Cosmic Variance; Open A Vein; AgroEcoPeople; QED Insight

Learned Folly

Posted by crshalizi at January 28, 2012 11:15 | permanent link

January 27, 2012

Changing How Changes Change (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about covariance matrices and (2) will be in Pittsburgh on Monday.

Since so much of multivariate statistics depends on patterns of correlation among variables, it is a bit awkward to have to admit that in lots of practical contexts, correlations matrices are just not very stable, and can change quite drastically. (Some people pay a lot to rediscover this.) It turns out that there are more constructive responses to this situation than throwing up one's hands and saying "that sucks", and on Monday a friend of the department and general brilliant-type-person will be kind enough to tell us about them:

Emily Fox, "Bayesian Covariance Regression and Autoregression"
Abstract: Many inferential tasks, such as analyzing the functional connectivity of the brain via coactivation patterns or capturing the changing correlations amongst a set of assets for portfolio optimization, rely on modeling a covariance matrix whose elements evolve as a function of time. A number of multivariate heteroscedastic time series models have been proposed within the econometrics literature, but are typically limited by lack of clear margins, computational intractability, and curse of dimensionality. In this talk, we first introduce and explore a new class of time series models for covariance matrices based on a constructive definition exploiting inverse Wishart distribution theory. The construction yields a stationary, first-order autoregressive (AR) process on the cone of positive semi-definite matrices.
We then turn our focus to more general predictor spaces and scaling to high-dimensional datasets. Here, the predictor space could represent not only time, but also space or other factors. Our proposed Bayesian nonparametric covariance regression framework harnesses a latent factor model representation. In particular, the predictor-dependent factor loadings are characterized as a sparse combination of a collection of unknown dictionary functions (e.g., Gaussian process random functions). The induced predictor-dependent covariance is then a regularized quadratic function of these dictionary elements. Our proposed framework leads to a highly-flexible, but computationally tractable formulation with simple conjugate posterior updates that can readily handle missing data. Theoretical properties are discussed and the methods are illustrated through an application to the Google Flu Trends data and the task of word classification based on single-trial MEG data.
Time and place: 4--5 pm on Monday, 30 January 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at January 27, 2012 14:25 | permanent link

January 26, 2012

Smoothing Methods in Regression (Advanced Data Analysis from an Elementary Point of View)

The constructive alternative to complaining about linear regression is non-parametric regression. There are many ways to do this, but we will focus on the conceptually simplest one, which is smoothing; especially kernel smoothing. All smoothers involve local averaging of the training data. The bias-variance trade-off tells us that there is an optimal amount of smoothing, which depends both on how rough the true regression curve is, and on how much data we have; we should smooth less as we get more information about the true curve. Knowing the truly optimal amount of smoothing is impossible, but we can use cross-validation to select a good degree of smoothing, and adapt to the unknown roughness of the true curve. Detailed examples. Analysis o how quickly kernel regression converges on the truth. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.

Readings: Notes, chapter 4 (R); Faraway, section 11.1

Optional readings: Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2012 10:30 | permanent link

Advantages of Backwardness (Advanced Data Analysis from an Elementary Point of View)

In which we try to discern whether poor countries grow faster.

Assignment, R, penn-select.csv data set

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2012 09:30 | permanent link

January 24, 2012

Model Evaluation: Error and Inference (Advanced Data Analysis from an Elementary Point of View)

Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection. Justifying model-based inferences; Luther and Süleyman.

Reading: Notes, chapter 3 (R for examples and figures).

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 24, 2012 10:30 | permanent link

The Truth About Linear Regression (Advanced Data Analysis from an Elementary Point of View)

Multiple linear regression: general formula for the optimal linear predictor. Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions. Linear regression coefficients will change with the distribution of the input variables: examples. Why R2 is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable problems). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means.

Reading: Notes, chapter 2 (R for examples and figures); Faraway, chapter 1 (continued).

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 24, 2012 10:15 | permanent link

January 22, 2012

Dungeons and Debtors

Attention conservation notice: A silly idea about gamifying credit cards, which would be evil if it worked.

To make a profit in an otherwise competitive industry, it helps if you can impose switching costs on your customers, making them either pay to stop doing business with you, or give up something of value to them. There are whole books about this, written by respected economists1.

This is why credit card companies are happy to offer rewards for use: accumulating points on a card, which would not move with you if you got a new card and transferred the balance, is an attempt to create switching costs. Unfortunately, from the point of view of the banks, people will redeem their points from time to time, so some money must be spent on the rewards. The ideal would be points which people would value but which would never cost the bank anything.

Item: Computer games are, deliberately, addictive. Social games are especially addictive.

Accordingly, if I were an evil and unscrupulous credit card company (but I repeat myself), I would create an online game, where people could get points either from playing the game, or from spending money with my credit card. For legal reasons, I think it would probably be best to allow the game to technically be open to everyone, but with a registration fee which is, naturally, waived for card-holders. Of course, the game software would be set up to announce on Facebook (etc.) whenever the player/debtor leveled up. I would also be tempted to award double points for fees, and triple for interest charges, but one could experiment with this. If they close their credit card account, they have to start the game over from the beginning.

The fact that online acquaintances can't tell whether the debtor is advancing through spending or through game-play helps keep the reward points worth having. It's true that the credit card company has to pay for the game's design (a one-time start-up cost) and the game servers, but these are fairly cheap, and the bank never has to cash out points in actual dollars or goods. The debtors themselves do all the work of investing the points with meaning and value. They impose the switching costs on themselves.

My plan is sheer elegance in its simplicity, and I will be speaking to an attorney about a business method patent first thing Monday.

1: Much can be learned about our benevolent new-media overlords from the fact that this book carries a blurb from Jeff Bezos of Amazon, and that Varian now works for Google.

Modest Proposals;

Posted by crshalizi at January 22, 2012 10:15 | permanent link

January 17, 2012

"Can't seem to face up to the facts"

Attention conservation notice: An academic paper you've never heard of, about a distressing subject, had bad statistics and is generally foolish.

Because my so-called friends like to torment me, several of them made sure that I knew a remarkably idiotic paper about power laws was making the rounds, promoted by the ignorant and credulous, with assistance from the credulous and ignorant, supported by capitalist tools:

M. V. Simkin and V. P. Roychowdhury, "Stochastic modeling of a serial killer", arxiv:1201.2458
Abstract: We analyze the time pattern of the activity of a serial killer, who during twelve years had murdered 53 people. The plot of the cumulative number of murders as a function of time is of "Devil's staircase" type. The distribution of the intervals between murders (step length) follows a power law with the exponent of 1.4. We propose a model according to which the serial killer commits murders when neuronal excitation in his brain exceeds certain threshold. We model this neural activity as a branching process, which in turn is approximated by a random walk. As the distribution of the random walk return times is a power law with the exponent 1.5, the distribution of the inter-murder intervals is thus explained. We confirm analytical results by numerical simulation.

Let's see if we can't stop this before it gets too far, shall we? The serial killer in question is one Andrei Chikatilo, and that Wikipedia article gives the dates of death of his victims, which seems to have been Simkin and Roychowdhury's data source as well. Several of these are known only imprecisely, so I made guesses within the known ranges; the results don't seem to be very sensitive to the guesses. Simkin and Roychowdhury plotted the distribution of days between killings in a binned histogram on a logarithmic scale; as we've explained elsewhere, this is a bad idea, which destroys information to no good purpose, and a better display is shows the (upper or complementary) cumulative distribution function1, which looks like so:

When I fit a power law to this by maximum likelihood, I get an exponent of 1.4, like Simkin and Roychowdhury; that looks like this:

Update: The 95% (bootstrap) confidence interval for the exponent is (1.35,1.48), which you will notice excludes 1.5.

On the other hand, when I fit a log-normal (because Gauss is not mocked), we get this:

After that figure, a formal statistical test is almost superfluous, but let's do it anyway, because why just trust our eyes when we can calculate? The data are better fit by the log-normal than by the power-law (the data are e10.41 or about 33 thousand times more likely under the former than the latter), but that could happen via mere chance fluctuations, even when the power law is right. Vuong's model comparison test lets us quantify that probability, and tells us a power-law would produce data which seems to fit a log-normal this well no more than 0.4 percent2 of the time. Not only does the log-normal distribution fit better than the power-law, the difference is so big that it would be absurd to try to explain it away as bad luck. In absolute terms, we can find the probability of getting as big a deviation between the fitted power law and the observed distribution through sampling fluctuations, and it's about 0.03 percent2b [R code for figures, estimates and test, including data.]

Since Simkin and Roychowdhury's model produces a power law, and these data, whatever else one might say about them, are not power-law distributed, I will refrain from discussing all the ways in which it is a bad model. I will re-iterate that it is an idiotic paper — which is different from saying that Simkin and Roychowdhury are idiots; they are not and have done interesting work on, e.g., estimating how often references are copied from bibliographies without being read by tracking citation errors4. But the idiocy in this paper goes beyond statistical incompetence. The model used here was originally proposed for the time intervals between epileptic fits. The authors realize that

[i]t may seem unreasonable to use the same model to describe an epileptic and a serial killer. However, Lombroso [5] long ago pointed out a link between epilepsy and criminality.
That would be the 19th-century pseudo-scientist3 Cesare Lombroso, who also thought he could identify criminals from the shape of their skulls; for "pointed out", read "made up". Like I said: idiocy.

As for the general issues about power laws and their abuse, say something once, why say it again?

Update 9 pm that day: Added the goodness-of-fit test (text before note 2b, plus that note), updated code, added PNG versions of figures, added attention conservation notice.
21 January: typo fixes (missing pronoun, mis-placed decimal point), added bootstrap confidence interval for exponent, updated code accordingly.

Manual trackback: Hacker News (do I really need to link to this?), Naked Capitalism (?!); Mathbabe; Wolfgang Beirl; Ars Mathematica (yes, I am that predictable); Improbable Research (I am not worthy)

1: This is often called the "survival function", but that seems inappropriate here.

2: On average, the log-likelihood of each observation was 0.20 higher under the log-normal than under the power law, and the standard deviation of the log likelihood ratio over the samples was only 0.54. The test statistic thus comes out to -2.68, and the one-sided p-value to 0.36%.

2b: Use a Kolmogorov-Smirnov test. Since the power law has a parameter estimated from data (namely, the exponent), we can't just plug in to the usual tables for a K-S test, but we can find a p-value by simulating the power law (as in my paper with Aaron and Mark), and when I do that, with a hundred thousand replications, the p-value is about 3*10-4.

3: There are in fact subtle, not to say profound, issues in the sociology and philosophy of science here: was Lombroso always a pseudo-scientist, because his investigations never came up to any acceptable standard of reliable inquiry? Or just because they didn't come up to the standards of inquiry prevalent at the time he wrote? Or did Lombroso become a pseudo-scientist, when enough members of enough intellectual communities woke up from the pleasure of having their prejudices about the lower orders echoed to realize that he was full of it? However that may be, this paper has the dubious privilege of being the first time I have ever seen Lombroso cited as an authority rather than a specimen.

4: Actually, for several years my bibliography data base had the wrong page numbers for one of my own papers, due to a typo, so their method would flag some of my subsequent works as written by someone who had cited that paper without reading it, which I assure you was not the case. But the idea seems reasonable in general.

Power Laws; Learned Folly

Posted by crshalizi at January 17, 2012 20:23 | permanent link

What's That Got to Do with the Price of Condos in California? (Advanced Data Analysis from an Elementary Point of View)

In which we practice the art of linear regression upon the California real-estate market, by way of warming up for harder tasks.

Assignment, data set

(Yes, the data set is now about as old as my students, but last week in Austin I was too busy drinking on 6th street having lofty conversations about the future of statistics to update the file with the UScensus2000 package.)

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 17, 2012 10:31 | permanent link

Regression: Predicting and Relating Quantitative Features (Advanced Data Analysis from an Elementary Point of View)

Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.

Readings: Notes, chapter 1; Faraway, chapter 1, through page 17.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 17, 2012 10:30 | permanent link

January 07, 2012

Mail Woes

If you sent me e-mail at my @stat.cmu.edu address in the last few days, I haven't gotten it, and may never get it. The address firstinitiallastname at cmu dot edu now points somewhere where I can read.

Posted by crshalizi at January 07, 2012 20:40 | permanent link

January 06, 2012

Sloth in Austin

I'll be speaking at UT-Austin next week, through the kindness of the division of statistics and scientific computation:

"When Can We Learn Network Models from Samples?"
Abstract: Statistical models of network structure are models for the entire network, but the data are typically just a sampled sub-network. Parameters for the whole network, which are what we care about, are estimated by fitting the model on the sub-network. This assumes that the model is "consistent under sampling" (forms a projective family). For the widely-used exponential random graph models (ERGMs), this trivial-looking condition is violated by many popular and scientifically appealing models; satisfying it drastically limits ERGMs' expressive power. These results are special cases of more general ones about exponential families of dependent variables, which we also prove. As a consolation prize, we offer easily checked conditions for the consistency of maximum likelihood estimation in ERGMs, and discuss some possible constructive responses.
Time and place: 2--3 pm on Wednesday, 11 January 2012, in Hogg Building (WCH), room 1.108

This will of course be based on my paper with Alessandro, but since I understand some non-statisticians may sneak in, I'll try to be more comprehensible and less technical.

Since this will be my first time in Austin (indeed my first time in Texas), and I have (for a wonder) absolutely no obligations on the 12th, suggestions on what I should see or do would be appreciated.

Self-Centered

Posted by crshalizi at January 06, 2012 14:15 | permanent link

January 03, 2012

Course Announcement: Advanced Data Analysis from an Elementary Point of View

It's that time again:

36-402, Advanced Data Analysis, Spring 2012
Description: This course introduces modern methods of data analysis, building on the theory and application of linear models from 36-401. Topics include nonlinear regression, nonparametric smoothing, density estimation, generalized linear and generalized additive models, simulation and predictive model-checking, cross-validation, bootstrap uncertainty estimation, multivariate methods including factor analysis and mixture models, and graphical models and causal inference. Students will analyze real-world data from a range of fields, coding small programs and writing reports.
Prerequisites: 36-401 (modern regression); or consent of instructor, in extraordinary cases
Time and place: 10:30--11:50 am, Tuesdays and Thursdays, in Porter Hall 100
Note: Graduate students in other departments wishing to take this course for credit need consent of the instructor, and should register for 36-608.

Fuller details on the class homepage, including a detailed (but subject to change) list of topics, and links to the compiled course notes. I'll post updates here to the notes for specific lectures and assignments, like last time.

This is the same course I taught last spring, only grown from sixty-odd students to (currently) ninety-three (from 12 different majors!). The smart thing for me to do would probably be to change nothing (I haven't gotten to re-teach a class since 2009), but I felt the urge to re-organize the material and squeeze in a few more topics.

The biggest change I am making is introducing some quality-control sampling. The course is to big for me to look over much of the students' work, and even then, that gives me little sense of whether the assignments are really probing what they know (much less helping them learn). So I will be randomly selecting six students every week, to come to my office and spend 10--15 minutes each explaining the assignment to me and answering live questions about it. Even allowing for students being randomly selected multiple times*, I hope this will give me a reasonable cross-section of how well the assignments are working, and how well the grading tracks that. But it's an experiment and we'll see how it goes.

* (exercise for the student): Find the probability distribution of the number of times any given student gets selected. Assume 93 students, with 6 students selected per week, and 14 weeks. (Also assume no one drops the class.) Find the distribution of the total number of distinct students who ever get selected.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 03, 2012 23:00 | permanent link

January 01, 2012

End of Year Inventory, 2011

Attention conservation notice: Navel-gazing.

Paper manuscripts completed: 12
Papers accepted: 2 [i, ii], one from last year
Papers rejected: 10 (fools! I'll show you all!)
Papers rejected with a comment from the editor that no one should take the paper I was responding to, published in the same glossy high-impact journal, "literally": 1
Papers in refereeing limbo: 4
Papers in progress: I won't look in that directory and you can't make me

Grant proposals submitted: 3
Grant proposals rejected: 4 (two from last year)
Grant proposals in refereeing limbo: 1
Grant proposals in progress for next year: 3

Talk given and conferences attended: 20, in 14 cities

Manuscripts refereed: 46, for 18 different journals and conferences
Manuscripts waiting for me to referee: 7
Manuscripts for which I was the responsible associate editor at Annals of Applied Statistics: 10
Book proposals reviewed: 3

Classes taught: 2
New classes taught: 2
Summer school classes taught: 1
New summer school classes taught: 1
Pages of new course material written: about 350

Students who are now ABD: 1
Students who are not just ABD but on the job market: 1

Letters of recommendation written: 8 (with about 100 separate destinations)

Promotion packets submitted: 1 (for promotion to associate professor, but without tenure)
Promotion cases still working through the system: 1

Book reviews published on dead trees: 2 [i, ii]
Non-book-reviews published on dead trees: 1

Weblog posts: 157
Substantive weblog posts: 54, counting algal growths

Books acquired: 298
E-book readers gratefully received: 1
Books driven by my mother from her house to Pittsburgh: about 800
Books begun: 254
Books finished: 204 (of which 34 on said e-book reader)
Books given up: 16
Books sold: 133
Books donated: 113

Book manuscripts completed: 0

Wisdom teeth removed: 4
Unwise teeth removed: 1

Major life transitions: 0

Self-Centered

Posted by crshalizi at January 01, 2012 12:00 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems