Attention conservation notice: Unskillful nattering about pop-culture ephemera.For the sake of my own sanity, I prefer to remain ignorant of the occult processes by which the direct mail gods decide to which catalogues to send to which people. (There's too much dynamic programming involved.) Today, for instance, they decided to inflict upon me the official Barbie doll spring 2010 collection preview, and like a fool I couldn't resist looking through it. Thus my life is made that much worse by learning that there is a Joan Jett Barbie doll. (I thought about embedding an image, but in this case pain shared is not pain eased.) I think I finally grasp what people mean when they talk about later cultural products assaulting parts of their childhood, in this case one I didn't even realize I valued.
Posted by crshalizi at November 24, 2009 19:27 | permanent link
A number of people have asked for my slides from the MERSIH conference the other week. So, here they are. (Anyone who was at my talk at SFI about a year ago will recognize the title, and much of the content.) I'm presently turning this into a proper manuscript, so comments are welcome. Please don't rip it off; I'll become very cross and may even hold my breath until I turn blue and pass out, and won't you be sorry then?
Posted by crshalizi at November 21, 2009 18:32 | permanent link
In which the starry heavens above submit to statistical analysis:
Enigmas of Chance; The Eternal Silence of These Infinite Spaces; Physics
Posted by crshalizi at November 19, 2009 12:02 | permanent link
Attention conservation notice: Of no use to you unless (1) you want to know what statisticians do at search-engine companies and (2) you are in Pittsburgh.
As always, the talk is free and open to the public.
Posted by crshalizi at November 13, 2009 15:09 | permanent link
Attention conservation notice: Quasi-teaching note giving an economic interpretation of the Neyman-Pearson lemma on statistical hypothesis testing.
Suppose we want to pick out some sort of signal from a background of noise. As every schoolchild knows, any procedure for doing this, or test, divides the data space into two parts, the one where it says "noise" and the one where it says "signal".* Tests will make two kinds of mistakes: they can can take noise to be signal, a false alarm, or can ignore a genuine signal as noise, a miss. Both the signal and the noise are stochastic, or we can treat them as such anyway. (Any determinism distinguishable from chance is just insufficiently complicated.) We want tests where the probabilities of both types of errors are small. The probability of a false alarm is called the size of the test; it is the measure of the "say 'signal'" region under the noise distribution. The probability of a miss, as opposed to a false alarm, has no short name in the jargon, but one minus the probability of a miss — the probability of detecting a signal when it's present — is called power.
Suppose we know the probability density of the noise p and that of the signal is q. The Neyman-Pearson lemma, as many though not all schoolchildren know, says that then, among all tests off a given size s, the one with the smallest miss probability, or highest power, has the form "say 'signal' if q(x)/p(x) > t(s), otherwise say 'noise'," and that the threshold t varies inversely with s. The quantity q(x)/p(x) is the likelihood ratio; the Neyman-Pearson lemma says that to maximize power, we should say "signal" if its sufficiently more likely than noise.
The likelihood ratio indicates how different the two distributions — the two hypotheses — are at x, the data-point we observed. It makes sense that the outcome of the hypothesis test should depend on this sort of discrepancy between the hypotheses. But why the ratio, rather than, say, the difference q(x) - p(x), or a signed squared difference, etc.? Can we make this intuitive?
Start with the fact that we have an optimization problem under a constraint. Call the region where we proclaim "signal" R. We want to maximize its probability when we are seeing a signal, Q(R), while constraining the false-alarm probability, P(R) = s. Lagrange tells us that the way to do this is to minimize Q(R) - t[P(R) - s] over R and t jointly. So far the usual story; the next turn is usually "as you remember from the calculus of variations..."
Rather than actually doing math, let's think like economists. Picking the set R gives us a certain benefit, in the form of the power Q(R), and a cost, tP(R). (The ts term is the same for all R.) Economists, of course, tell us to equate marginal costs and benefits. What is the marginal benefit of expanding R to include a small neighborhood around the point x? Just, by the definition of "probability density", q(x). The marginal cost is likewise tp(x). We should include x in R if q(x) > tp(x), or q(x)/p(x) > t. The boundary of R is where marginal benefit equals marginal cost, and that is why we need the likelihood ratio and not the likelihood difference, or anything else. (Except for a monotone transformation of the ratio, e.g. the log ratio.) The likelihood ratio threshold t is, in fact, the shadow price of statistical power.
I am pretty sure I have not seen or heard the Neyman-Pearson lemma explained marginally before, but in retrospect it seems too simple to be new, so pointers would be appreciated.
Manual trackback: John Barrdear
Updates: Thanks to David Kane for spotting a typo.
*: Yes, you could have a randomized test procedure, but the situations where those actually help pretty much define "boring, merely-technical complications."
Posted by crshalizi at November 08, 2009 03:06 | permanent link
My lesson-plan having survived first contact with
the enemy students, it's time to start posting the lecture
handouts & c. This page will be updated as the semester goes on; the RSS
feed for it should be here.
The class homepage has more
information.
Homework 1, due 4 September: assignment, R, data; SOLUTIONS
Homework 2, due 11 September: assignment; SOLUTIONS TEXT; SOLUTIONS R
Homework 3, due 18
September: assignment;
Homework 4, due 25 September: assignment; SOLUTIONS
Pre-midterm review (12 October): highlights of the course to date; no
handout.
MIDTERM (14
October): exam, solutions
Homework 5, due 23 October: assignment; solutions
Homework 6, due Friday, 30 October: assignment, data set; solutions
Homework 7, due Friday, 6 November: assignment
Posted by crshalizi at November 05, 2009 22:45 | permanent link
My old Blosxom installation (v. 2.0.2), after several years of working nicely, is growing increasingly cranky, and mulishly refusing to generate or update posts as the whim takes it. (I am not sure how much kicking and shoving it will need to produce this.) I'd appreciate a pointer to something which works similarly, but does work: I write posts in plain HTML in Emacs and drop them in a directory; it makes them look nice. If it handles tags and/or LaTeX nicely, so much the better.
Posted by crshalizi at November 04, 2009 19:34 | permanent link