February 04, 2010

Upcoming Gigs: Bristol

I am giving two talks in Bristol next week about (not so coincidentally) my two latest papers.

"The Computational Structure of Spike Trains"
Bristol Centre for Complexity Sciences, SM2 in the School of Mathematics, 2 pm on Tuesday 9 February
Abstract: Neurons perform computations, and convey the results of those computations through the statistical structure of their output spike trains. Here we present a practical method, grounded in the information-theoretic analysis of prediction, for inferring a minimal representation of that structure and for characterizing its complexity. Starting from spike trains, our approach finds their causal state models (CSMs), the minimal hidden Markov models or stochastic automata capable of generating statistically identical time series. We then use these CSMs to objectively quantify both the generalizable structure and the idiosyncratic randomness of the spike train. Specifically, we show that the expected algorithmic information content (the information needed to describe the spike train exactly) can be split into three parts describing (1) the time-invariant structure (complexity) of the minimal spike-generating process, which describes the spike train statistically; (2) the randomness (internal entropy rate) of the minimal spike-generating process; and (3) a residual pure noise term not described by the minimal spike-generating process. We use CSMs to approximate each of these quantities. The CSMs are inferred nonparametrically from the data, making only mild regularity assumptions, via the causal state splitting reconstruction algorithm. The methods presented here complement more traditional spike train analyses by describing not only spiking probability and spike train entropy, but also the complexity of a spike train's structure. We demonstrate our approach using both simulated spike trains and experimental data recorded in rat barrel cortex during vibrissa stimulation.
Joint work with Rob Haslinger and Kristina Lisa Klinkner.
"Dynamics of Bayesian updating with dependent data and misspecified models"
Statistics seminar, Department of Mathematics, Seminar Room SM3, 2:15pm on Friday 20 February
Abstract: Much is now known about the consistency of Bayesian non-parametrics with independent or Markovian data.. Necessary conditions for consistency include the prior putting enough weight on the right neighborhoods of the true distribution; various sufficient conditions further restrict the prior in ways analogous to capacity control in frequentist nonparametrics. The asymptotics of Bayesian updating with mis-specified models or priors, or non-Markovian data, are far less well explored. Here I establish sufficient conditions for posterior convergence when all hypotheses are wrong, and the data have complex dependencies. The main dynamical assumption is the asymptotic equipartition (Shannon-McMillan-Breiman) property of information theory. This, plus some basic measure theory, lets me build a sieve-like structure for the prior. The main statistical assumption concerns the compatibility of the prior and the data-generating process, bounding the fluctuations in the log-likelihood when averaged over the sieve-like sets. In addition to posterior convergence, I derive a kind of large deviations principle for the posterior measure, extending in some cases to rates of convergence, and discuss the advantages of predicting using a combination of models known to be wrong.
(More on this paper)

I'll also be lecturing about prediction, self-organization and filtering to the BCCS students.

I presume that I will not spend the whole week talking about statistics, or working on the next round of papers and lectures; is there, I don't know, someplace in Bristol to hear music or something?

Self-centered; Enigmas of Chance; Complexity; Minds, Brains, and Neurons

Posted by crshalizi at February 04, 2010 13:48 | permanent link

January 31, 2010

Books to Read While the Algae Grow in Your Fur, January 2010

Virginia Swift, Hello, Stranger
Enjoyable mystery with eccentric academics, God-botherers and gentrification in present-day Laramie. Nth book in a series; I'll keep an eye out for the others.
Intelligence
Smart crime/spook drama set in one of the most attractive cities in the world (Vancouver), which could only be improved if it didn't end in the WORST CLIFFHANGER EVER. (Ahem.) Not, of course, as good as The Wire, but then nothing is.
Daniel Waley, The Italian City-Republics
Short, readable political-institutional history of the communes of northern and central Italy. He begins with the communes starting to take form in the towns and wrest control from their bishops, say around 1000, and ends by about 1400, by which point the towns had almost all, except for Venice, descended into some form of monarchy, generally under the domination of the local feudal land/war-lords. (Waley says little about Venice, which in retrospect seems odd, though it didn't strike me while reading it.) While Waley is good at describing this historical trajectory, he says little about why so many Italian cities followed it. I'd think it'd be natural to compare the Italian case to contemporary cities elsewhere, but I think there is exactly one sentence on them. (I imagine all kinds of interesting comparative work could be or has been done.) But within those limits, it's a nice book. Waley has also written studies on Siena and Orvieto, which sound interesting.
Terry Pratchett, Nation
You don't really need me to recommend Terry Pratchett to you, especially when he's writing about how people find ways to go on when their world has been pointlessly destroyed.
Richard Hofstadter, Anti-Intellectualism in American Life
Astonishingly, this still feels like it fits after a lapse of half a century. The whole "tax-raising, latte-drinking, sushi-eating, Volvo-driving, New-York-Times-reading, body-piercing, Hollywood-loving, left-wing freak-show" nonsense of the last thirty years now makes a lot more sense; and the chapters about the history of American education were frankly a revelation to me. (The chapter on Dewey and his pedagogical influence seems like a model of being respectfully but unrelentingly critical.) No doubt for real historians, this is all painfully outdated, and whatever's actually sound has long since been incorporated into other works, which don't provide such unintentional moments of amusement as, when listing the unfair accusations heaped on Jefferson, including keeping a slave mistress and having children by her. (For that matter I don't care for the Beats very much, but they certainly contributed more to our literature than he thought they would.) Still: the man could write.
ObLinkage: Steve Laniel on AIiAL.
D. N. MacKenzie (trans.), Poems from the Divan of Khushâl Khân Khattak
The first significant body of poetry in Pashto; Khushal was a 17th century warlord in what is now the Northwest Frontier, owing his position to a combination of tribal authority and appointment by the Mughals. This seems to be the most recent translation of a selection from his poetry in English, dating from 1965. It is arranged on no particular principles (some Pashto editions are, following tradition, arranged alphabetically by the first letter of the poem), which produces a rather odd effect, that I might summarize as follows: Khushal is happily in love: wow is the beloved a hottie. Khushal is unhappily in love: separation is awful, especially if it's because the beloved doesn't want to see Khushal. Khushal is a fierce warrior who is also a keen hunter; falconry rules. Khushal has a remarkable capacity for drink. (Go ahead, try and tell me that's allegorical.) Aurangzeb sucks, especially in comparison to his father. (Well, he did, and sticking Khushal in jail can't have won him any points.) The Afghans should rally to Khushal and defeat Aurangzeb! Men are treacherous, false-faced bastards, but Afghans are really worse than the rest. (To be fair, having one of your own sons wage war on you in the name of Aurangzeb has got to be pretty embittering.) Khushal will withdraw from the sinful world and spend his days in pious penance. Khushal glorifies God. Repeat.
My grandfather's extemporized translations were better English poetry, but I will never hear those again.
Moez Draief and Laurent Massoulié, Epidemics and Rumors in Complex Networks
A nice short (< 120 pp.) account of the connections among stochastic network models, branching processes, and epidemic models, of the "susceptible-infectious-susceptible" or "susceptible-infectious-recovered" type, including epidemics on networks. ("Rumors" are assumed to fall under such models.)
They begin with the basic Galton-Watson branching process model, where each member of a population produces a random number of descendants (possibly zero), independently of everyone else, and this distribution is constant both within and across generations. Following over a century of tradition, they look at whether the population survives forever or goes extinct, how large it gets, how long it takes to go extinct if it does, etc. This then gets turned into a simple epidemic model ("member of population" = infected individual). It also maps on to the Erdos-Renyi network model, with "has an edge with" taking the place of "is a descendant of": pick your favorite node, and connect it to a random selection of other nodes, the number following a binomial distribution; connect each of them in turn to more random nodes. The size of the branching process's population corresponds to the size of the connected component in the graph. The mapping really only really works in the limit of low-density graphs (the size of the component is roughly a sum of independent quantities when there are no loops), but it's enough to study the emergence of a giant component and the behavior of the diameter of the graph. As a prelude to more sophisticated models, they then prove a form of Kurtz's Theorem on the convergence of Markov chains to ordinary differential equations in the large-population limit. The second half of the book rehearses Watts-Strogatz small-world and Barabási-Albert scale-free networks (including mention of Yule but not, oddly, of Herbert Simon), before wrapping up with epidemic models on graphs, and the "viral marketing" problem of deciding where, on a known and fixed network, to start an epidemic for maximum impact.
Of course, since it's a mathematics book, the problem of how to link these models to data isn't even dismissed.
This isn't a ground-breaking work, but it's nice to have all this in a single book, and one a bit more accessible than, say, Durrett's Random Graph Dynamics (though by the same token less comprehensive). The implied reader is comfortable with stochastic processes at the level of something like Grimmett and Stirzaker; measure-theoretic issues are avoided, even when discussing Kurtz's Theorem. (Their version is thus much less precise and powerful than his, but vastly easier to understand.) Anyone comfortable with that level of probability could read it without much trouble, and I'd happily use it in a class.
Disclaimer: I read a draft of the manuscript for the publisher in 2007, and they sent me a free copy of the book, but I have no stake in its success.
Joseph L. Graves, Jr., The Emperor's New Clothes: Biological Theories of Race at the Millennium
There are places where he lapses into biological jargon, and others where I think lay readers would have benefited from more detailed rebuttals of the common counter-arguments, but over-all I recommend this very strongly. (Thanks to I.B. for lending me her copy.)
Pascal Massart, Concentration Inequalities and Model Selection
Using empirical process theory, and more specifically concentration of measure, to get finite-sample, i.e., non-asymptotic, risk bounds for various forms of model selection. The basic strategy is to find conditions under which every model in a reasonable class will, with high probability, perform about as well on sample data as they can be expected to do on new data; this involves constraining the richness or flexibility of the model class. A little extra work, and the addition of suitable penalties to the fit, gets bounds that extend over multiple classes of model, even over a countable infinity of classes. Among other highlights, Massart shows why the famous AIC heuristic is often definitely sub-optimal, and how to correct it; it also offers corrections to Vapnik's (much better) structural risk minimization, and a nice treatment of data-set splitting (= 1-fold cross-validation). All of this is for IID data, so the usual caveats apply. Formally self-contained, but realistically some previous exposure to empirical processes (at the level of Pollard's notes if not higher) will be needed. Available for free as a large PDF preprint, but I found it much more convenient to read a dead-tree copy.
Elizabeth Bear, New Amsterdam
Alternate-history fantasy mystery stories. Owing something, perhaps, to Randall Garrett's "Lord Darcy" stories (the name of the heroine is distinctly suspicious), but without their complacency about the benevolence of the powers that be.
David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining
I've used this three times now in teaching 36-350, with about 75 students total over the years. I keep using it because it's the best textbook on data-mining I know. It covers the whole process, soup to nuts: data collection (and the importance of understanding what the data actually mean, if anything), cleaning, databases, model construction, model evaluation, optimization, visualization, etc. All of this is organized around four crucial questions: what kind of pattern are we looking for in the data, and how do we represent those patterns? how do we score representations against each other? how do we search for good representations? what do we need to do to implement that search efficiently? All of the basic methods (and many not so basic ones) are in here, all seen as different answers to these questions. I find its explanations extremely clear, and my students seem to as well. I regard it as a strength that it is not tied to pre-canned software, which would only encourage dependency and thoughtlessness.
The only real competition, to my mind, is Hastie, Tibshirani and Friedman. But the Stanford book is distinctly more about statistics, and has more statistical theory and math (though not, from my point of view, a lot of either), whereas this one is distinctly focused on data-mining and on computation. It would be nice if Hand &c. had material on support vector machines, and more on ensemble methods; perhaps it's time for a second edition?
Disclaimer: I almost took a post-doc under Smyth rather than coming to CMU, back in 2004; also, the MIT Press sent me a free review copy of this book (in 2001).

Books to Read While the Algae Grow in Your Fur; Pleasures of Detection, Portraits of Crime; Enigmas of Chance; Scientifiction and Fantastica; Writing for Antiquity; Afghanistan and Central Asia; The Natural Science of the Human Species; Networks; The Beloved Republic; The Commonwealth of Letters; Learned Folly

Posted by crshalizi at January 31, 2010 23:59 | permanent link

January 19, 2010

The Work of Art in the Age of Mechanical Reproduction

Attention conservation notice: 800+ words of inconclusive art/technological/economic-historical musings.

This thread over at Unfogged reminds me of something that's puzzled me for years, ever since reading this: why didn't prints displace paintings the same way that printed books displaced manuscript codices? Why didn't it become expected that visual artists, like writers, would primarily produce works for reproduction? (No doubt, in that branch of the wave-function*, obsessive fans still want to get the original drawings, but obsessive fans also collect writer's manuscripts, or even their typewriters, as well as their mass-produced books.) 16th century engraving technology was strong enough that it could implement powerful works of art (vide), so that can't be it. And by the 18th century at least writers could make a living (however precarious) from writing for the mass public, so why didn't visual artists (for the most part) do likewise? (Again, it's manifestly not as though technology has regressed.) Why is it still the case that a real, high-class visual artist is someone who makes one-offs? I know that reproductions have been important since at least the late 1800s, but for works and artists who first made their reputation with unique, hand-made objects, which is as though the only books which got sent to the printing press were ones which had already circulated to acclaim in manuscript.

Some possibilities I don't buy:

  1. Aesthetic limitations. There are valuable effects which can be achieved with a big original painting which prints just can't match. Response: there are effects you can achieve with an illuminated, calligraphic manuscript which you can't match with movable type, either. Those weren't valuable enough to keep printed books from taking over. Why the difference? Why not a focus on what can be done through prints, which is quite a lot? (Witness the experience of the 20th century and later, when most art lovers know most works of art they enjoy through reproductions.)
  2. Color. A real limitation; even today, getting color done well in mass visual media is not entirely trivial (cf.), and early modern Europe certainly couldn't do it at all. Response: What makes color so important? We know that some great art was made without its benefit, and we don't really know how much better it could have gotten had prints been the medium of choice. Even if color was all that, it just pushes the shift to the late 19th century.
  3. Artists too expensive. Whether you are producing one painting or a thousand prints, there is a considerable fixed cost to the artist's time and training. (The first print is very expensive.) Individual patrons could afford this; the mass public could not. Response: The same argument would apply to books. Besides, high fixed costs usually drive towards seeking a wider market, so that the fixed costs are distributed over a larger number of people. The argument would have to be one of failure of demand — that where there was one man willing to pay 100 guilders (or whatever) for a painting, there were not, say, 120 people willing to pay 1 guilder for prints. Why not?
  4. Paintings too cheap. There have always been too many people wanting to be visual artists for them to all make a living as original artists. One of the things they could do instead was paint copies. Response: The economy of scale problem still applies.
  5. States too weak. In a competitive market, market prices equal marginal costs. The marginal cost of producing another copy of a print is very, very low, so low that the fixed costs of drawing and designing it in the first place aren't recouped. As usual, then, competitive markets fail massively at producing informational goods. The modern solution is to institute and vigorously enforce intellectual property rights. These are monopoly privileges which the state grants to certain individuals; if anyone tries to compete with these favorites of the powers that be, then "goons with guns" (as my libertarian friends like to say) come to stop them. Doing this requires a really massively powerful and intrusive state, which is a relatively recent phenomenon, and not to be lightly deployed on behalf of artists, of all people. Artists who tried to go the mass-production route would've been even more starvation-prone than those who didn't attempt it. Response: An exactly parallel argument would explain why writers didn't embrace printing.
  6. The revolution has happened. The overwhelming majority of visual artists do aim their work at reproduction; it's just a small minority which continues to produce one-offs. This minority has, however, a lot more cultural prestige. Response: There's some merit to this, but it's bizarre and anomalous; it's not as though our really high-class literature was still illuminated or calligraphic manuscripts, and printing was reserved for declassé "commercial" work.
The most convincing argument I've been able to come up with has to do with how visual artworks were and are used. Even in manuscript, books were for reading: private consumption, or near enough. European culture, however, provided a steady stream of demand for works of visual art for public display, which is rather different. It were just a matter of pictures you'd like to look at for your own enjoyment, perhaps prints would serve. But if it's about decorating the church/guildhall/imposing estate, then you need a unique painting of St. Jerome/the burgomasters/the master of the house. The main point is that the owner has the resources to command their very own artwork, not the work's intrinsic aesthetic properties (which good reproductions would share). But even then, why not develop a second stream of reproducible artwork for private rather than conspicuous consumption? And indeed why not try to achieve similar effects in print, thereby broadcasting the message?

Updates, 31 January 2010: In correspondence, Elihu Gerson points to an interesting-looking book relevant to the social-use explanation.

Also, it seems I should clarify that I am not asking why (as Vukutu puts it) "people desire original works of visual art rather than printed reproductions". If you are going to paint in oils on canvas, then of course making a flat print of the result going to lose some detail of the physical object, and those details might contribute in important ways to people's experience of the object; there might be a real esthetic loss to looking at a reproduction of a painting. What I am asking is why then we do not produce artworks which are designed for reproduction. Or rather, we do produce lots of such art, but it's not seen as very valuable, and generally not even real art in the honorific sense. "Printed reproductions of physical paintings lose valuable details" does not answer "Why did our visual arts continue to focus on making one-off works?", unless you perhaps you add some extra premises, like (i) no print-reproducible image could be as esthetically valuable as a three-dimensional painting, and (ii) that difference in intrinsic quality was extremely important to the people who consumed art, and I am very dubious about both of these.

Finally, I don't think it's sufficient to point to "tradition", since traditions change all the time. That deserves another argument, but another time. In lieu of which, I'll just offer a quotation from a favorite book, Joseph (Abu Thomas) Levenson's Confucian China and Its Modern Fate; he is writing about ideas, but as he makes clear, what he says applies just as much to aesthetic or practical choices as to intellectual ones.

With the passing of time, ideas change. This statement is ambiguous, and less banal than it seems. It refers to thinkers in a given society, and it refers to thought. With the former shade of meaning, it seems almost a truism: men may change their minds or, at the very least, make a change from the mind of their fathers. Ideas at last lose currency, and new ideas achieve it. If we see an iconoclastic Chinese rejection, in the nineteenth and twentieth centuries, of traditional Chinese beliefs, we say that we see ideas changing.

But an idea changes not only when some thinkers believe it to be outworn but when other thinks continue to hold it. An idea changes in its persistence as well as in its rejection, changes "in itself" and not merely in its appeal to the mind. While iconoclasts relegate traditional ideas to the past, traditionalists, at the same time, transform traditional ideas in the present.

This apparently paradoxical transformation-with-preservation of a traditional idea arises form a change in its world, a change in the thinker's alternatives. For (in a Taoist manner of speaking) a thought includes what its thinker eliminates; an idea has its particular quality from the fact that other ideas, expressed in other quarters, are demonstrably alternatives. An idea is always grasped in relative association, never in absolute isolation, and no idea, in history, keeps a changeless self-identity. An audience which appreciates that Mozart is not Wagner will never hear the eighteenth-century Don Giovanni. The mind of a nostalgic European medievalist, though it may follow its model in the most intimate, accurate detail, is scarcely the mirror of a medieval mind; there is sophisticated protest where simple affirmation is meant to be. And a harried Chinese Confucianist among modern Chinese iconoclasts, however scrupulously he respects the past and conforms to the letter of tradition, has left his complacent Confucian ancestors hopelessly far behind him...

An idea, then, is a denial of alternatives and an answer to a question. What a man really means cannot be gathered solely from what he asserts; what he asks and what other men assert invest his ideas with meaning. In no idea does meaning simply inhere, governed only by it degree of correspondence with some unchanging objective reality, without regard to the problems of its thinker. [pp. xxvii--xxviii; for context, this passage was first published in 1958]

*: With apologies to the blogger formerly known as "the blogger formerly known as 'The Statistical Mechanic' ".

Manual trackback: Mostly Hoofless; 3 Quarks Daily; Cliopatria (!); Vukutu.

Writing for Antiquity

Posted by crshalizi at January 19, 2010 22:01 | permanent link

December 31, 2009

Books to Read While the Algae Grow in Your Fur, December 2009

Duplicity
It'd be a spoiler to simply to count the number of layers of trickery here; and it's romantic; what more could you want? (Recommended by Kate Nepveu.)
Burrowers
Creepy, grim little western horror movie. The ecology almost makes sense even. (No purchase link just because Powell's doesn't seem to sell it.)
Nick Abadzis, Laika
The story of Korolev, the Soviet space program, and of course the eponymous heroine, the first terrestrial creature in space. I kept muttering "The dog dies at the end", but by the end it mattered to me that the dog died.
Seamus Cooper, The Mall of Cthulhu
Yes, it's about an ancient alien squid-god trying to destroy Life As We Know It via a shopping mall in suburban New England, with all the usual indescribable horrors, and lots of joking references to previous works in the genre ("Ms. Harker"!). But also: a convincingly unidealized yet affecting friendship. — Apparently there will be a sequel; I will read it very eagerly.
John Layman and Rob Guillory, Chew: Taster's Choice
Unquestionably the finest, and grossest, detective story about food and black-market poultry ever, at least among those executed as comic books.
Susan Hough, Predicting the Unpredictable: The Tumultuous Science of Earthquake Prediction
Full (positive) review coming later for a magazine.
Philip Kitcher, In Mendel's Mirror: Philosophical Reflections on Biology
Collection of Kitcher's papers about the philosophy of biology and related issues, mostly tied (as the subtitle suggests) to genetics. The most interesting paper for me was "Developmental Decomposition and the Future of Human Behavioral Ecology" [JSTOR], about what'd be involved in doing something like evolutionary psychology properly. (I should warn readers of that chapter — Kitcher doesn't do so properly — that he takes as his case study the explanation of incest avoidance, which leads him into a detailed examination of the situations where incest is not avoided. There are sound reasons for this, but it's not for the squeamish or, I'd imagine, the traumatized.) Those who like this sort of thing will find it to be just the sort of thing they like.
Sarah Graves, The Book of Old Houses
Astonishingly, I have yet to experience series fatigue after eleven books. The little bits of Lovecraftian atmosphere add here are, thankfully, debunked inside the story. (Previously: 1--4, 5, 6--10)
Rosemary Kirstein, The Language of Power
The continuation of what is at once an epic fantasy full of marvels and an inspiring depiction of the life of the mind. The scene with Rowan, Will, and the pair of invisible dragons will linger in my memory, and I found myself pleased and astonished at Kirstein's depiction of the sheer beyond-all-experience strangeness of magic. (There, I think I have avoided spoilers for once.)
My only complaint: where is the rest of the series?!?? I want them now!!!!
A. R. Luria, Cognitive Development: Its Social and Cultural Foundations
Re-read after a lapse of ten years. I still think it's a fascinating and profound, though also flawed, work; the successes now loom larger for me than the flaws, though the latter are very real. (To recapitulate what I've written elsewhere: Uzbekistani peasants in the 1930s had excellent reasons to play dumb when Soviet officials came around asking bizarre and leading questions, especially about foreign countries, or premised on obvious falsehoods.) Two things which now impress me more: first, the stuff about visual illusions and colors; and, second, the demonstration that the subjects could solve more concrete problems which were formally identical to the ones they couldn't, or wouldn't, solve in abstract or contrary-to-fact form.
Hans Reichenbach, The Direction of Time
One of the greatest of the logical positivists takes a whirl at reconciling time-reversible microscopic physics with irreversible macroscopic processes in 1956. I began reading this a long time ago, then bogged down in the last chapter, on quantum statistical mechanics; I took the occasion of a long plane flight to re-read and finish it, and am very glad I did. The discussion of relativity, thermodynamics and ergodic theory is clear and sound, if not — at least now — ground-breaking. (It seems extremely odd that general relativity is so ignored; but perhaps just as well, since cosmology was about to be revolutionized.) One highlight for me was the idea of "branch systems", and using the consistency of arrows of time across nearly-isolated mixing processes (not called that) to construct a more global arrow of time. Even the chapter on quantum effects was more interesting than I though it would be, being mostly concerned with the identity (or lack thereof) of quantum particles through time, though I think the treatment in Teller is superior.
The most fascinating part of the book for me, however, is Reichenbach's efforts to build up a notion of time which has not just an order but a direction from causal relations. (If we pick an axis in space, as he says, it has two equal good orders, say left-to-right or right-to-left; time is not just ordered but directed, past-to-future.) He develops in considerable detail the theme that edges in the causal network of spatio-temporal events can be oriented based on the principle that dependent events become independent conditional on their common causes. This is incredibly close to modern ideas about inferring the structure of causal graphical models (see Spirtes, Glymour and Scheines below; Glymour studied under two of Reichenbach's students, Wesley Salmon and Cynthia Schuster). Sadly, I would almost say tragically, however, Reichenbach makes the crucial mistake of thinking that the same sort of independence can easily happen conditional on common effects, when actually it almost never does. (My marginal note at this point is, I see, "NOOOO!") Arguably, this delayed the development of causal inference for decades.
Reichenbach was drawing on many different areas of physics and mathematics which have all made a lot of progress in the last half-century, so I am a bit uneasy about recommending it unreservedly to non-specialists. (There is a new book I can recommend, unreservedly and even reversibly, to general readers.) But the core ideas are very much right, and it's still an imposing and inspiring piece of work.
ObLinkage 1: Emerson on Reichenbach on time.
ObLinkage 2: Speaking of Bérubé (as I was, parenthetically), Steven Gimbel's "If I Had A Hammer: Why Logical Positivism Better Accounts for the Need for Gender and Cultural Studies" tries to appropriate Reichenbach, and logical positivism more generally, for the forces of political correctness.
L. G. Godfrey, Misspecification Tests in Econometrics: The Lagrange Multiplier Principle and Other Approaches
Lots and lots about checking for whether you have the wrong terms in your parametric (especially linear-and-Gaussian) model. Less fundamental than the approaches of White or Hart, and also better adapted to the background and habits typical of econometricians. (This is no accident.)
(Oh, the Lagrange multiplier principle? Suppose your model imposes some restrictions on the parameters, as compared to some larger model you can embed it in. Imagine estimating your model by doing a constrained maximization of the likelihood in the larger model; how big does the Lagrange multiplier on your constraints have to be? How much are you paying in likelihood, in other words, to enforce the constraint? If your model is true, then for large samples the cost is very small and the Lagrange multiplier tends towards zero.)
Warren Ellis and Paul Duffield, Freakangels, vol. 3
In which we consider various forms of rebuilding.
Peter Spirtes, Clark Glymour and Richard Scheines, Causation, Prediction and Search
Re-read as part of preparing for my lecture on casual discovery. I spent much of the winter of 2000 working my way through the first edition, and wound up completely imprinted on its way of thinking about what causal relationships are, how we should reason about them, and how we can find them from empirical evidence. On causation and prediction it now has an equal in Pearl's book (and I admit the latter looks prettier), but on search, that is, on discovering causal structure, there is still no rival. Their key observation is that even though correlation does not imply causation, correlations must have causal explanations. (This idea goes back to Herbert Simon, and Hans Reichenbach [see above] at least.) So patterns of correlations, among more than just two variables, constrain what causal structures are possible. Sometimes they constrain the causal structure uniquely, in other cases it's only partially identified by the dependencies. And of course there is always the possibility of making a mistake with limited data. But none of this is any different for causal discovery than it is for any other form of statistical inference. The great contribution of this book is showing that causal discovery can be just another learning problem. They have transformed metaphysical misery into ordinary statistical unhappiness.
(I can't resist illustrating, though it's necessarily a bit involved. Take three variables, call then X, Y and Z. We find that there is a correlation between X and Y which we can't make go away, no matter what we control for, and likewise between Y and Z, but not between X and Z. There are four possibilities compatible with this: the causal chain X->Y->Z; the opposite causal chain from Z to X; a "fork" where Y the common cause, X<-Y->Z; and a "collider" or "conjunction" where Y is the common effect, X->Y<-Z. In the first three cases, Y "screens off" X from Z — those variables are independent of each other, conditional on Y. So the absence of conditional independence definitely tells us which way the causal links point. In fact, conditional independence at a collider, while mathematically possible, requires no-margin-for-error adjustment of the parameters, so if we assume that such conspiracies are absent ["faithfulness"], we have conditional dependence if and only if there's a collider, which gives us the direction of causation from correlations. "Orienting" some correlations in this manner induces orientations in others, distinguishing forks and the two kinds of chain. For more, see the aforementioned lecture notes, or indeed this book.)
Disclaimers: All three authors have appointments in the CMU Machine Learning department which I'm also affiliated with, etc., etc. And the MIT Press sent me a free copy for review in 2001. (There is a reason my totem is a sloth, yes?)
Nunzio DeFilippis, Christina Weir, Brian Hurtt and Arthur Dela Cruz, Skinwalker
Starts off as a procedural psycho-killer-in-Indian-country mystery, and then gets... strange.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Enigmas of Chance; Philosophy; Physics; The Great Transformation; Minds, Brains, and Neurons; Afghanistan and Central Asia; The Progressive Forces; Biology; The Eternal Silence of These Infinite Spaces; Cthulhiana; The Dismal Science

Posted by crshalizi at December 31, 2009 23:59 | permanent link

Output Summary

After long, long journeys, in one case going back to 2003, some papers have come out. Alphabetically by distinguished co-authors:

Aaron Clauset, CRS, and M. E. J. Newman, "Power-law distributions in empirical data", arxiv:0706.1062 = SIAM Review 51 (2009): 661--703
I wrote about this when we first submitted it. In the intervening two and a half years, many people have continued to make the baby Gauss cry by publishing, and publicizing, supposed power laws based on completely inadequate and unreliable methods. Because their methods are unsound, one has no idea whether they're right or not, short of re-analyzing the data properly. I sometimes imagine these authors singing
I could be right
I could be wrong
I feel nice when I sing this song
but many of them at least pretend to care about the truth of their claims, so I piously hope that in the fullness of time the community of inquirers will come around to using reliable methods. In which regard I am gratified, but also astonished, to see that this is already the most-cited paper I've contributed to, by such a large margin that it's unlikely anything else I do will ever rival it.
See also: Aaron.
Rob Haslinger, Kristina Klinkner and CRS, "The Computational Structure of Spike Trains", arxiv:1001.0036 = Neural Computation 22 (2010): 121--157
I haven't written about this one before, though I feel free to do so now that we're published. This was fun venture into applying state-reconstruction ideas, specifically CSSR, to neural spike trains, specifically the barrel cortex of the rat, which is it represents sensory input from the whiskers. (The experimentalists build special whisker-vibrating machines, which are actually quite impressive.) We do, I think, a pretty good job of predicting the spike trains in an entirely non-parametric way, and showing how their complexity is modulated by sensory stimuli — how much tweaking the whisker drives the cortical neuron.
CRS, "Dynamics of Bayesian Updating with Dependent Data and Misspecified Models", arxiv:0901.1342 = Electronic Journal of Statistics 3 (2009): 1039--1074
I also wrote about this when I first submitted it. I'm particularly grateful to one of the reviewers, who read the paper very carefully, totally got it, and provided many helpful suggestions, one of which grew into a new theorem on rates of convergence. Thank you, benevolent and thoughtful anonymous referee person! Also, the publication process at EJS was extremely fast and utterly painless.

Other output: my first hemi-demi-semi-co-supervised student graduating with his doctorate (a fine piece of work I wish I could link to); a paper draft finished and sitting on a collaborator's desk (no pressure!); the homophily paper is almost finished (I need to speed up some simulations and cut out most of the jokes); half-a-dozen referee reports of my own (a deliberate new low; made easier by boycotting Elsevier); five papers edited for Annals of Applied Statistics (a new high); nine lectures newly written or massively revised for 36-350; all the problem sets for 350 re-worked and much better; three books reviewed for American Scientist (and a whole bunch of mini-reviews for nowhere in particular).

On the other hand, no chapters finished for Statistical Analysis of Complex Systems; three very patient collaborators in different parts of Manhattan waiting for me to turn things around; one superhumanly patient collaborator in Santa Fe ditto; and one project which has been accreting since 2007 really needs to be cut and polished into some papers. Resolution for next year: more papers.

Self-Centered; Enigmas of Chance; Complexity; Power Laws; Minds, Brains, and Neurons

Posted by crshalizi at December 31, 2009 18:45 | permanent link

December 28, 2009

Significance, Power, and the Will to Believe

Attention conservation notice: 2100 words on parallels between statistical hypothesis testing and Jamesian pragmatism; an idea I've been toying with for a decade without producing anything decisive or practical. Contains algebraic symbols and long quotations from ancient academic papers. Also some history-of-ideas speculation by someone who is not a historian.

When last we saw the Neyman-Pearson lemma, we were looking at how to tell whether a data set x was signal or noise, assuming that we know the statistical distributions of noise (call it p) and the distribution of signals (q). There are two kinds of mistake we can make here: a false alarm, saying "signal" when x is really noise, and a miss, saying "noise" when x is really signal. What Neyman and Pearson showed is that if we fix on a false alarm rate we can live with (a probability of mistaking noise for signal; the "significance level"), there is a unique optimal test which minimizes the probability of misses --- which maximizes the power to detect signal when it is present. This is the likelihood ratio test, where we say "signal" if and only if q(x)/p(x) exceeds a certain threshold picked to control the false alarm rate.

The Neyman-Pearson lemma comes from their 1933 paper; but the distinction between the two kinds of errors, which is clearly more fundamental. Where does it come from?

The first place Neyman and/or Pearson use it, that I can see, is their 1928 paper (in two parts), where it's introduced early and without any fanfare. I'll quote it, but with some violence to their notation, and omitting footnoted asides (from p. 177 of part I; "Hypothesis A" is what I'm calling "noise"):

Setting aside the possibility that the sampling has not been random or that the population has changed during its course, x must either have been drawn randomly from p or from q, where the latter is some other population which may have any one of an infinite variety of forms differing only slightly or very greatly from p. The nature of the problem is such that it is impossible to find criteria which will distinguish exactly between these alternatives, and whatever method we adopt two sources of error must arise:
  1. Sometimes, when Hypothesis A is rejected, x will in fact have been drawn from p.
  2. More often, in accepting Hypothesis A, x will really have been drawn from q.

In the long run of statistical experience the frequency of the first source of error (or in a single instance its probability) can be controlled by choosing as a discriminating contour, one outside which the frequency of occurrence of samples from p is very small — say, 5 in 100 or 5 in 1000. In the density space such a contour will include almost the whole weight of the field. Clearly there will be an infinite variety of systems from which it is possible to choose a contour satisfying such a condition....

The second source of error is more difficult to control, but if wrong judgments cannot be avoided, their seriousness will at any rate be diminished if on the whole Hypothesis A is wrongly accepted only in cases where the true sampled population, q, differs but slightly from p.

The 1928 paper goes on to say that, intuitively, it stands to reason that the likelihood ratio is the right way to accomplish this. The point of the 1933 paper is to more rigorously justify the use of the likelihood ratio (hence the famous "lemma", which is really not set off as a separate lemma...). Before unleashing the calculus of variations, however, they warm up with some more justification (pp. 295--296 of their 1933):
Let us now for a moment consider the form in which judgments are made in practical experience. We may accept or we may reject a hypothesis with varying degrees of confidence; or we may decide to remain in doubt. But whatever conclusion is reached the following position must be recognized. If we reject H0, we may reject it when it is true; if we accept H0, we may be accepting it when it is false, that is to say, when really some alternative Ht is true. These two sources of error can rarely be eliminated completely; in some cases it will be more important to avoid the first, in others the second. We are reminded of the old problem considered by LAPLACE of the number of votes in a court of judges that should be needed to convict a prisoner. Is it more serious to convict an innocent man or to acquit a guilty? That will depend upon the consequences of the error; is the punishment death or fine; what is the danger to the community of released criminals; what are the current ethical views on punishment? From the point of view of mathematical theory all that we can do is to show how the risk of the errors may be controlled and minimised. The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator.
(Neither Laplace nor LAPLACE, are mentioned in their 1928 paper.)

Let's step back a little bit to consider the broader picture here. We have a question about what the world is like --- which of several conceivable hypotheses is true. Some hypotheses are ruled out on a priori grounds, others because they are incompatible with evidence, but that still leaves more than one admissible hypothesis, and the evidence we have does not conclusively favor any of them. Nonetheless, we must chose one hypothesis for purposes of action; at the very least we will act as though one of them is true. But we may err just as much through rejecting a truth as through accepting a falsehood. The two errors are symmetric, but they are not the same error. In this situation, we are advised to pick a hypothesis based, in part, on which error has graver consequences.

This is precisely the set-up of William James's "The Will to Believe". (It's easily accessible online, as are summaries and interpretations; for instance, an application to current controversies by Jessa Crispin.) In particular, James lays great stress on the fact that what statisticians now call Type I and Type II errors are both errors:

There are two ways of looking at our duty in the matter of opinion, — ways entirely different, and yet ways about whose difference the theory of knowledge seems hitherto to have shown very little concern. We must know the truth; and we must avoid error, — these are our first and great commandments as would-be knowers; but they are not two ways of stating an identical commandment, they are two separable laws. Although it may indeed happen that when we believe the truth A, we escape as an incidental consequence from believing the falsehood B, it hardly ever happens that by merely disbelieving B we necessarily believe A. We may in escaping B fall into believing other falsehoods, C or D, just as bad as B; or we may escape B by not believing anything at all, not even A.

Believe truth! Shun error! — these, we see, are two materially different laws; and by choosing between them we may end by coloring differently our whole intellectual life. We may regard the chase for truth as paramount, and the avoidance of error as secondary; or we may, on the other hand, treat the avoidance of error as more imperative, and let truth take its chance. Clifford ... exhorts us to the latter course. Believe nothing, he tells us, keep your mind in suspense forever, rather than by closing it on insufficient evidence incur the awful risk of believing lies. You, on the other hand, may think that the risk of being in error is a very small matter when compared with the blessings of real knowledge, and be ready to be duped many times in your investigation rather than postpone indefinitely the chance of guessing true. I myself find it impossible to go with Clifford. We must remember that these feelings of our duty about either truth or error are in any case only expressions of our passional life. Biologically considered, our minds are as ready to grind out falsehood as veracity, and he who says, "Better go without belief forever than believe a lie!" merely shows his own preponderant private horror of becoming a dupe. He may be critical of many of his desires and fears, but this fear he slavishly obeys. He cannot imagine any one questioning its binding force. For my own part, I have also a horror of being duped; but I can believe tbat worse things tban being doped may happen to a man in this world: so Clifford's exhortation has to my ears a thoroughly fantastic sound. It is like a general informing his soldiers that it is better to keep out of battle forever than to risk a single wound. Not so are victories either over enemies or over nature gained. Our errors are surely not such awfully solemn things. In a world where we are so certain to incur them in spite of all our caution, a certain lightness of heart seems healthier than this excessive nervousness on their behalf. At any rate, it seems the fittest thing for the empiricist philosopher.

From here the path to James's will to believe is pretty clear, at least in the form he advocated it, which is that of picking among hypotheses which are all "live"*, and where some choice must be made among them. What I am interested in, however, is not the use James made of this distinction, but simply the fact that he made it.

So far as I have been able to learn, no one drew this distinction between seeking truth and avoiding error before James, or if they did, they didn't make anything of it. (Even for Pascal in his wager, the idea that believing in Catholicism if it is false might be bad doesn't register.) Yet this is just what Neyman and Pearson were getting at, thirty-odd years later. There is no mention of James in these papers, or indeed of any other source. They present the distinction as though it were obvious, though eight decades of subsequent teaching experience shows it is anything but. Neyman and Pearson were very interested in the foundations of statistics, but seem to have paid no attention to earlier philosophers, except for the arguable case of Pearson's father Karl and his Grammar of Science (which does not seem to mention James). Yet there it is. It really looks like two independent inventions of the whole scheme for judging hypotheses.

My prejudices being what they are, I am much less inclined to think that James illuminates Neyman and Pearson than the other way around. James was, so to speak, arguing that we should trade significance — the risk of mistaking noise for signal — for power, finding some meaningful signal in what he elsewhere called the "blind molecular chaos" of the physical universe. Granting that there is a trade-off here, however, one has to wonder about how stark it really is (cf.), and whether his will-to-believe is really the best way to handle it. Neyman and Pearson suggest we should look for a procedure for resolving metaphysical questions which maximizes the ability to detect larger meanings for a given risk of seeing faces in clouds — and would let James and Clifford set their tolerance for that risk to their own satisfaction. Of course, any such procedure would have to squarely confront the fact that there may be no way of maximizing power against multiple alternatives simultaneously...

The extension to confidence sets, consisting of all hypotheses not rejected by suitably powerful tests (per Neyman 1937) is left as an exercise to the reader.

*: As an example of a "dead" hypothesis, James gives believing in "the Mahdi", presumably Muhammad Ahmad ibn as-Sayyid Abd Allah. I'm not a Muslim, and those of my ancestors who were certainly weren't Mahdists, but this was still a "What do you mean 'we', white man?" moment in my first reading of the essay. To be fair, James gives me many fewer such moments than most of his contemporaries.

Manual trackback: Brad DeLong; Robo; paperpools (I am not worthy!)

Enigmas of Chance; Philosophy; Modest Proposals

Posted by crshalizi at December 28, 2009 00:08 | permanent link

December 08, 2009

Uniform Probability on Infinite Spaces Considered Harmful

Attention conservation notice: 1000 words on a short probability problem. Several hobby-horses get to take turns around the yard.

Wolfgang at Mostly Harmless poses a problem (I've lightly tweaked the notation):

Consider a random process X(t) which generates a series of 0s and 1s, but many more 0s because the probability for X(t) = 1 decreases with t as 2-t.

Now assume that we encounter this process not knowing 'how far we are already', in other words we don't know the value of t. The question is: "What is the probability to get a 1?"

Unfortunately there are two ways to answer this question. The first calculates the 'expectation value', as a physicist would call it, or 'the mean' as a statistician would put it, which is zero.

In other words, we sum over all possible t with equal weight and have to consider s = sum( 2-t ) with t = 1, 2, ... N; It is not difficult to see that s = 1/2 + 1/4 + ... equals 1.

The answer is therefore Pr(X=1) = s/N = 1/N and because N is infinite (the process never ends) we get Pr(X=1) = 0.

The second answer simply looks at the definition of the process and points out that Pr(X=1) = 2-T, where T is the current value of t. Although we don't know T it must have some finite value and it is obvious that Pr(X=1) > 0.

So which one is it, Pr(X=1) = 0 or Pr(X=1) > 0?

Fatwa: The second answer is correct, and the first is wrong.

Discussion: This is a cute example of the weirdness which results when we attempt to put uniform distributions on infinite spaces, even in the simplest possible case of the positive integers. The first way of proceeding assumes that the notion of a uniform probability distribution on the natural numbers makes sense, and that it obeys the same rules as an ordinary probability distribution. Unfortunately, these two requirements are incompatible. This is because ordinary probability distributions are countably additive. We are all familiar with the fact that probability adds across disjoint events: Pr(X= 0 or 1) = Pr(X=0)+Pr(X=1). Moreover, we are all comfortable with the idea that this holds for more than two events. The probability that X first =1 by time 3 is the sum of the probability of the first 1 being at time t=1, plus it being at t=2, plus it being at t=3. Carrying this out to any finite collection of disjoint events is called finite additivity. However, as I said, probability measures are ordinarily required to be countably additive, meaning that this holds even for a countable infinity of disjoint events.

And here we have trouble. The natural numbers are (by definition!) countable, so the probability of all integers is the sum of the probability of each integer,

Pr(T an integer) = sum(Pr(T=t))

The left-hand side must be 1. For a uniform distribution, we expect that all the terms in the sum on the right-hand side must be equal, otherwise it's not "uniform". But either all the terms are equal and positive, in which case the right-hand side is infinite, or all the terms are equal and zero, in which case the right-hand side is zero. Hence, there is no countably-additive uniform probability measure on the integers, and the first approach, which leads to the conclusion that Pr(X(T)=1)=0, is mathematically incoherent.

Now, there are such things as finitely-additive probability measures, but they are rather weird beasts. To specify one of them on the integers, for example, it's not enough to give the probability of each integer (as it is for a countably-additive measure); that only pins down the probability of finite sets, and sets whose complements are finite. It does not, for example, specify the probability of the even numbers. There turn out to be several different ways of defining uniform distributions on the natural numbers, which are not equivalent. Under all of them, however, any finite set must have probability zero, and so at a random time T, it is almost certain that Pr(X(T)=1) is less than any real number you care to name. Hence, the expectation value of this random probability is indeed zero.

(Notice, however, that if I try to calculate the expectation value of any function f(t) by taking a probability-weighted sum over values of t, as the first answer does, I will get the answer 0 when T follows a uniform finitely-additive measure, even if f(t)=1 for all t. The weighted-sum-of-arguments definition of expectation — the one reminiscent of Riemann integrals — does not work for these measures. Instead one must use a Lebesgue-style definition, where one takes a weighted sum of the values of the function, the weights being the measures of the sets giving those values. [More exactly, one partitions the range of f and takes the limit as the partition becomes finer and finer.] The equivalence of the summing over the domain and summing over the range turns on, precisely, countable additivity. The argument in the previous paragraph shows that here this expectation value must be less than any positive number, yet not negative, hence zero.)

Finitely-additive probability measures are profoundly weird beasts, though some of my colleagues have what I can only consider a perverse affection for them. On the other hand, attempts to construct a natural countably-additive analog of a uniform distribution on infinite sets have been universally unsuccessful; this very much includes the maximum entropy approach. The project of giving ignorance a unique representation as a probability measure is, IMSAO, a failure. If one picks some countably-additive prior distribution over the integers, however, then at least one value of t must have strictly positive probability, and the expectation value of Pr(X(T)=1) is positive, though how big it is will depend on the prior distribution. (As usual, the role of a Bayesian prior distribution is to introduce bias so as to reduce variance.) Alternately, one simply follows the second line of reasoning and concludes that, no matter what t might be, the probability is positive.

Enigmas of Chance

Posted by crshalizi at December 08, 2009 10:55 | permanent link

December 05, 2009

36-350, Data Mining: Course Materials (Fall 2009)

My lesson-plan having survived first contact with the enemy students, it's time to start posting the lecture handouts & c. This page will be updated as the semester goes on; the RSS feed for it should be here. The class homepage has more information.

  1. Introduction to the course (24 August) What is data mining? how is it used? where did it come from? Some themes.
  2. Information retrieval and similarity searching I (26 August) Finding the data you are looking for. Ideas we will avoid: meta-data and cataloging; meanings. Textual features. The bag-of-words representation; its vector form. Measuring similarity and distance for vectors. Example with the New York Times Annotated Corpus.
  3. IR continued (28 August). The trick to searching: queries are documents. Search evaluation: precision, recall, precision-recall curves; error rates. Classification: nearest neighbors and prototypes; classifier evaluation by mis-classification rate and by confusion matrices. Inverse document frequency weighting. Visualizing high-dimensional data by multi-dimensional scaling. Miscellaneous topics: stemming, incorporating user feedback.

    Homework 1, due 4 September: assignment, R, data; solutions

  4. Page Rank (31 August). Links as pre-existing feedback. How to exploit link information? The random walk on the graph; using the ergodic theorem. Eigenvector formulation of page-rank. Combining page-rank with textual features. Other applications. Further reading on information retrieval.
  5. Image Search, Abstraction and Invariance (2 September). Similarity search for images. Back to representation design. The advantages of abstraction: simplification, recycling. The bag-of-colors representation. Examples. Invariants. Searching for images by searching text. An example in practice. Slides for this lecture.
  6. Information Theory I (4 September). Good features help us guess what we can't represent. Good features discriminate between different values of unobserved variables. Quantifying uncertainty with entropy. Quantifying reduction in uncertainty/ discrimination with mutual information. Ranking features based on mutual information. Examples, with code, of informative words for the Times. Code.
    Supplementary reading: David P. Feldman, Brief Tutorial on Information Theory, chapter 1

    Homework 2, due 11 September: assignment; solutions text and R code

  7. Information Theory II (9 September). Dealing with multiple features. Joint entropy, the chain rule for entropy. Information in multiple features. Conditional information, chain rule for information, conditional independence. Interactions, positive and negative, and redundancy. Greedy feature selection with low redundancy. Example, with code, of selecting words for the Times. Sufficient statistics and the information bottleneck. Code.
    Supplementary reading; Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", arxiv:cs.AI/0308002
  8. Categorization; Clustering I (11 September). Dividing the world up into categories. Classification: known categories with labeled examples. Taxonomy of learning problems (supervised, unsupervised, semi-supervised, feedback, ...). Clustering: discovering unknown categories from unlabeled data. Benefits of clustering, with an digression on where official classes come from. Basic criterion for good clusters: lots of information about features from little information about cluster. Practical considerations: compactness, separation, parsimony, balance. Doubts about parsimony and balance. The k-means clustering algorithm, or unlabeled prototype classification: analysis, geometry, search. Appendix: geometric aspects of the prototype and nearest-neighbor method.

    Homework 3, due 18 September: assignment; solutions

  9. Clustering II (14 September). Distances between partitions; variation-of-information distance. Hierarchical clustering by agglomeration and its varieties. Picking the number of clusters by merging costs. Performance of different clustering methods on various doodles. Why we would like to pick the number of clusters by predictive performance, and why it is hard to do at this stage. Reifying clusters.
  10. Transformations: Rescaling and Low-Dimensional Summaries (16 September). Improving on our original features. Re-scaling, standardization, taking logs, etc., of individual features. Forcing things to be Gaussian considered harmful. Low-dimensional summaries by combining features. Exploiting geometry to eliminate redundancy. Projections on to linear subspaces. Searching for structure-preserving projections.
  11. Principal Components I (18 September). Principal components are the directions of maximum variance. Derivation of principal components as the best approximation to the data in a linear subspace. Equivalence to variance maximization. Avoiding explicit optimization by finding eigenvalues and eigenvectors of the covariance matrix. Example of principal components with cars; how to tell a sports car from a minivan. The standard recipe for doing PCA. Cautions in interpreting PCA. Data-set used in the notes.

    Homework 4, due 25 September: assignment; solutions

  12. Principal Components II (21 September). PCA + information retrieval = latent semantic indexing; why LSI is a Good Idea. PCA and multidimensional scaling.
  13. Factor Analysis (23 and 25 September). From PCA to factor analysis by adding noise. Roots of factor analysis in causal discovery: Spearman's general factor model and the tetrad equations. Problems with estimating factor models: number of equations does not equal number of unknowns. Solution 1, "principal factors", a.k.a. estimation through heroic feats of linear algebra. Solution 2, maximum likelihood, a.k.a. estimation through imposing distributional assumptions. The rotation problem: the factor model is unidentifiable; the number of factors may be meaningful, but the individual factors are not.
  14. The Truth about PCA and Factor Analysis (28 September) PCA is data reduction without any probabilistic assumptions about where the data came from. Picking number of components. Faking predictions from PCA. Factor analysis makes stronger, probabilistic assumptions, and delivers stronger, predictive conclusions --- which could be wrong. Using probabilistic assumptions and/or predictions to pick how many factors. Factor analysis as a first, toy instances of a graphical causal model. The rotation problem once more with feeling. Factor models and mixture models. Factor models and Thomson's sampling model: an outstanding fit to a model with a few factors is actually evidence of a huge number of badly measured latent variables. Final advice: it all depends, but if you can only do one, try PCA. R code for the Thomson sampling model.
  15. Nonlinear Dimensionality Reduction I: Locally Linear Embedding (5 October). Failure of PCA and all other linear methods for nonlinear structures in data; spirals, for example. Approximate success of linear methods on small parts of nonlinear structures. Manifolds: smoothly curved surfaces embedded in higher-dimensional Euclidean spaces. Every manifold looks like a linear subspace on a sufficiently small scale, so we should be able to patch together many small local linear approximations into a global manifold. Local linear embedding: approximate each vector in the data as a weighted linear combination of its k nearest neighbors, then find the low-dimensional vectors best reconstructed by these weights. Solving the optimization problems by linear algebra. Coding up LLE. A spiral rainbow. R.
  16. Nonlinear Dimensionality Reduction II: Diffusion Maps (9 October). Making a graph from the data; random walks on this graph. The diffusion operator, a.k.a. Laplacian. How the Laplacian encodes the shape of the data. Eigenvectors of the Laplacian as coordinates. Connection to page-rank. Advantages when data are not actually on a manifold. Example.

    Pre-midterm review (12 October): highlights of the course to date; no handout.
    MIDTERM (14 October): exam, solutions

    Homework 5, due 23 October: assignment; solutions

  17. Regression I: Basics. Guessing a real-valued random variable; why expectation values are mean-square optimal point forecasts. The regression function; why its estimation must involve assumptions beyond the data. The bias-variance decomposition and the bias-variance trade-off. First example of improving prediction by introducing variance. Ordinary least squares linear regression as smoothing. Other linear smoothers: k-nearest-neighbors and kernel regression. How much should we smooth? R, data for running example
  18. Regression II: The Truth About Linear Regression (21 October). Linear regression is optimal linear (mean-square) prediction; we do this because we hope a linear approximation will work well enough over a small range. What linear regression does: decorrelate the input features, then correlate them separately with the response and add up. The extreme weakness of the probabilistic assumptions needed for this to make sense. Difficulties of linear regression; collinearity, errors in variables, shifting distributions of inputs, omitted variables. The usual extra probabilistic assumptions and their implications. Why you should always looking at residuals. Why you generally shouldn't use regression for causal inference. How to torment angels. Likelihood-ratio tests for restrictions of nice models.
  19. Regression III: Extending Linear Regression (23 October). Weighted least squares. Heteroskedasticity: variance is not the same everywhere. Going to consult the oracle. Weighted least squares as a solution to heteroskedasticity. Nonparametric estimation of the variance function. Local polynomial regression: local constants (= kernel regression), local linear regression, higher-order local polynomials. Lowess = locally-linear smoothing for scatter plots. The oracles fall silent.

    Homework 6, due Friday, 30 October: assignment, data set; solutions

  20. Evaluating Predictive Models (26 and 28 October). In-sample, out-of-sample and generalization loss or error; risk as expected loss on new data. Under-fitting, over-fitting, and examples with polynomials. Methods of model selection and controlling over-fitting: empirical risk minimization, penalization, constraints/sieves, formal learning theory, cross-validation. Limits of generalization. R for creating figures.
  21. Smoothing Methods in Regression (30 October). How much smoothing should we do? Approximation by local averaging. How much smoothing we should do to find the unknown curve depends on how smooth the curve really is, which is unknown. Adaptation as a partial substitute for actual knowledge. Cross-validation for adapting to unknown smoothness. Application: testing parametric regression models by comparing them to nonparametric fits. The bootstrap principle. Why ever bother with parametric regressions? R code for some of the examples.

    Homework 7, due Friday, 6 November: assignment; solutions: text and code

  22. Additive Models (2 November). A nice feature of linear models: partial responses, partial residuals, and backfitting estimations. Additive models: regression curve is a sum of partial response functions; partial residuals and the backfitting trick generalize. Parametric and non-parametric rates of convergence. The curse of dimensionality for unstructured nonparametric models. Additive models as a compromise, introducing bias to reduce variance. Example with the data from homework 6.
  23. Classification and Regression Trees (4 and 6 November). Prediction trees. A classification tree we can believe in. Prediction trees combine simple local models with recursive partitioning; adaptive nearest neighbors. Regression trees: example; a little math; pruning by cross-validation; more R mechanics. Classification trees: basics; measuring error by mis-classification; weighted errors; likelihood; Neyman-Pearson classifiers. Uncertainty for trees.

    Homework 8, due 5 pm on Monday, 16 November: assignment; solutions; R for solutions

  24. Combining Models 1: Bagging and Model Averaging (9 November)
  25. Combining Models 2: Diversity and Boosting (11 November)
  26. Linear Classifiers (16 November). Geometry of linear classifiers. The perceptron algorithm for learning linear classifiers. The idea of "margin".
  27. Logistic Regression (18 November). Attaching probabilities to linear classifiers: why would we want to? why would we use the logistic transform to do so? More-than-binary logistic regression. Maximizing the likelihood; Newton's method for optimization. Generalized linear models and generalized additive models; testing GLM specifications with GAMs.
  28. Support Vector Machines (20 November). Turning nonlinear problems into linear ones by expanding into high-dimensional feature spaces. The dual representation of linear classifiers: weight training points, not features. Observation: in the dual representation, only inner products of vectors matter. The kernel trick: kernel functions let us compute inner products in feature spaces without computing the features. Some bounds on the generalization error of linear classifiers based on "margin" and the number of training points with non-zero weight ("support vectors"). Learning support vector machines by trading in-sample performance against bounds on over-fitting.

    Homework 9, due at 5 pm on Monday, 30 November: assignment

  29. Density Estimation (23 November). Histograms as distribution estimates. Glivenko-Cantelli, "the fundamental theorem of statistics". Histograms as density estimates; selecting density estimates by cross-validation. Kernel density estimates. Why kernels are better than histograms. Curse of dimensionality again. Hint at alternatives to kernel density estimates.
  30. Mixture Models, Latent Variables and the EM Algorithm (30 November). Compressing and restricting density estimates. Mixtures of limited numbers of distributions. Mixture models as probabilistic clustering; finally an answer to "how many clusters?" The EM algorithm as an iterative way of maximizing likelihood with latent variables. Analogy to k-means. More theory of the EM algorithm. Applications: density mixtures, signal processing/state estimation, mixtures of regressions, mixtures of experts; topic models and probabilistic latent semantic analysis. A glance at non-parametric mixture models.
  31. Graphical Causal Models (2 December). Distinction between causation and association, and between causal and probabilistic prediction. Some examples. Directed acyclic graphs and causal models. The Markov property. Conditional independence via separation. Faithfulness.
  32. Causal Inference (4 December). Estimating causal effects; control for confounding. Discovering causal structure: the SGS algorithm and its variants. Limitations.

    Take-home final exam, due 15 December: assignment; data sets: expressdb_cleaned (20 Mb), HuIyer_TFKO_expression (20 Mb). With great thanks to Dr. Timothy Danford.

Corrupting the Young; Enigmas of Chance

Posted by crshalizi at December 05, 2009 14:39 | permanent link

November 30, 2009

Books to Read While the Algae Grow in Your Fur, November 2009

Jen Van Meter, Christine Norrie and Chynna Clugston-Major, Hopeless Savages
Incredibly sweet and charming; whether it's really punk rock I couldn't say. (I completely forget where I saw this recommended, but thanks to whoever it was.)
Mike Mignola and Christopher Golden, Baltimore, or, the Steadfast Tin Soldier and the Vampire
Stories within stories, framed by the Great War unleashing not the influenza pandemic of 1918, but a vampire-zombie apocalypse. Many, many nods to prior horror fiction (most obviously Dracula, but also "The Masque of the Red Death", etc.), and a lot of folkloric elements used to nicely creepy effect. (But isn't "Mircea" a masculine name?) Mignola's drawings are decorative and atmospheric, but not integral.
Cat Rambo and Jeff VanderMeer, The Surgeon's Tale, and Other Stories
The highlight is the title story, which occupies about half this little book, and breathes new life — you should forgive the expression — into the ancient trope of the Resurrection Gone Awry. Of the rest, Rambo's "The Dead Girl's Wedding March" and "A Key Decides Its Destiny" are the best, followed by VanderMeer's "The Farmer's Cat". About the last story, an extended joke about a Lovecraftian menu, the kindest thing I can say is that the authors must've had fun writing it.
F. T. Marinetti, The Untameables
Not actually recommended, unless you want a violent Futurist words-in-liberty fantasy full of orientalism, racism, and (most poisonously) formulaic decadence. On this evidence, Marinetti was much better at writing manifestoes (and cookbooks) than fiction. — I have had this on my shelf since, so help me, 1994, when I first started reading about Futurism; I should've gotten rid of it long ago.
Jason Aaron, R. M. Guéra, Davide Furnò and Francesco Francavilla, Scalped, vol. 5: High Lonesome
Noir blacker than coal-dust. Earlier installments: 1, 2--4.
Phil and Kaja Foglio, Agatha Heterodyne and the: Golden Trilobite; Voice of the Castle; Chapel of Bones
Vols. 6--8 of Girl Genius; in which the lost heir reclaims the ancestral castle, through the power of Science! (As well as perfecting the coffee-maker.)
Lev Vygotsky, Mind in Society: Development of Higher Psychological Processes
A fairly clear and cohesive statement of Vygotsky's key ideas, which were a species of pre-cognitive Marxist psychology. Here is his concerned with looking at what happens to children's cognitive development when they bring together their practical abilities to manipulate their bodies and tools, with their communicative abilities to use words (and other signs) — specifically their learning to use speech to guide behavior, especially their own behavior. (This is very much about the unity of theory and praxis; but Dewey said similar things, from a background of American pragmatism. [Then again, Marx himself was pretty pragmatist already in 1845.] Vygotsky mentions Dewey once here, without much understanding.) Specifically, Vygotsky claims that speech and discursive thought come to guide behavior through children learning to talk to themselves about what they need to do to solve practical tasks, which they come to from previously learning to talk to others about what to do, or trying to get others to do it for them.
This sets up three big themes of Vygotsky's. First, he thinks that all of the characteristically human ("higher") mental processes originate as social interactions, which we then learn to internalize and carry out independently. The Marxist themes (especially out of Engels) here are obvious; he does not, needless to say, demonstrate his contention, and in any case seems to overlook the point that an organism needs a lot of specialized structure and capacity to engage in those social interactions in the first place, let alone internalize them! But he deserves, I think, considerable credit for raising the problem, and the related one of how we use tools and signs in our environment to extend our own cognition. (Here some of the experiments he reports on what's needed for children to make effective use of memory aids are fascinating; but this approaches stigmergy rather than social labor.) Secondly, he emphasizes, when assessing children's development, that looking at what they can do on their own is just picking out (at best) what they have finished learning. If instead one assesses what they can do "with assistance, collaboratively or in groups", what he calls the "zone of proximal development", one gets a sense of what they are learning and could learn. One might argue, though he doesn't pursue this, that even in adults, activities like scientific investigation never really leave the zone of proximal development, especially not at the highest levels of accomplishment. (Skilled scientists can do their students' homework problems on their own, but not their research.) Thirdly, he is emphatic that if you want to investigate cognitive development, you need to probe the developmental process, and not just its end-products in over-learned habits or polished skills. He suggests that the ideal experiments would actually evoke cognitive development under laboratory conditions, and that his students carried out such experiments; he does not provide enough details to assess such a claim.
The book concludes with some chapters on play, imagination and make-believe, and on the "pre-history" of writing (i.e., things which come before writing in individual development but have some kinship to it).
Despite the way I've written, this isn't actually a book Vygotsky planned and wrote out. It was assembled by its translators out of distinct Russian manuscripts and preliminary translations provided by A. R. Luria, and then considerably revised by the translators. (I presume this is where anarchonisms like "World War I" came from.) How much of the result is really due Vygotsky, or to Vygotsky-and-Luria, and how much to the Americans, I can't say. The latter do however provide an introduction, a brief biographical note and an extensive afterword; all of these are probably most useful to readers previously unacquainted with Vygotsky or his school.
Michael Bérubé, The Left at War
The first half of this is Bérubé arguing with (as he says) the "Manichean Left" over Afghanistan, Iraq, the Balkan wars of the 1990s, and their general orientation and understanding of how the capitalist democracies work, or don't. I find myself in complete agreement with this, including Bérubé's positive vision, and unable to add anything of value. (Read it.)
The second half is an argument about the theory of ideology, the notion of hegemony, and the not-exactly-a-discipline of cultural studies. More specifically, it's a plea to get beyond the dualism of "everything with any shred of mass appeal is a tool of the System" on the one hand, and pretending that fans reading their own meaning into music videos (or whatever) has anything to do with smashing said system, on the other. His plea is for a more nuanced approach to ideology, which recognizes that the leading ideas of any society are never all of a block, that political power always comes from coalitions bringing together many divergent interests and ideas, etc., etc. He is particularly fond of a version of these ideas articulated by Stuart Hall, which seem, to judge by his quotations, quite reasonable but not at all special, unless one is starting out from the most benighted precincts of Marxism (e.g., that old fraud Althusser). Linked to this, Bérubé is quite strenuous about the importance of issues other than economic justice for any left that wants to be serious about making sure that everyone isn't just formally free, but can actually use and enjoy their freedom.
Again, I find it hard to disagree with most of this; I just fail to see what the two halves of the book have to do with each other. The best argument I can reconstruct on his behalf would be something like this: lots of people on the left have drifted, or been pushed, into Manichean positions because it seems to follow from the way they understand ideology. If a better account had been widely disseminated, fewer of them would have pursued that dead end. Some parts of cultural studies had articulated that better account, but they failed to make themselves heard; thus, if only more attention had been paid to debates about how best to make use of Gramsci to understand British politics in the early 1980s, the left would not have backed itself into such a corner in 2001--2003.
I find myself drifting into snark in that last sentence, which is unfair. I agree that the very crude counter-cultural thinking and functionalism which seem very common on the left are Not Helpful. I'm just skeptical that (1) giving more weight to cultural studies would have made this better, and that (2) the particular cultural studies sub-tradition Bérubé points to is really the best available way of thinking about these matters. I have my own favorite candidate, but mostly it's that what he takes from this sub-tradition just don't seem that distinctive, except for its Marxist background. (For instance, compare what Bérubé, following Hall, says about first asking what's right or true about an ideology with what Boudon said in his excellent book on The Analysis of Ideology [= L'origine des idées reçues]. And no, I'm not trying to play "My esoteric French theorist trumps your merely-obscure British theorist"; Boudon is a distinguished sociologist who just happens to have made this the center of his theory of ideology.) Which is not to say that we shouldn't work to develop and disseminate better ideas about these matters!
Bérubé tends to avoid advancing his case directly, but rather to get his point across by discussing some other writer's work, or in some cases some other writer's discussion of a third author, etc. (I suspect this is a professional deformation of literary critics.) This is a manner of writing which drives some people crazy, and one which I find very tiresome in other hands, but he pulls it off. Assuming this won't put you off, I recommend this very strongly if you have any interest at all in progressive politics.
Disclaimer: Bérubé blogs at Crooked Timber, where I've guest-blogged, etc., but we've never met or corresponded, and I have no stake in the success of his book.
Manual trackback: Michael Bérubé.
Rick Geary, Trotsky: A Graphic Biography
Well-told and well-drawn — though nothing in the rest of the art matches to the level of the hero/monster panels of the opening pages (and cover!).
Sarah Graves, Unhinged; Mallets Aforethought; Tool and Die; Nail Biter; Trap Door
Honestly, I'm a bit surprised series fatigue hasn't set in yet; but I continue to enjoy these. (And Eastport continues to suffer levels of violent death comparable to post-invasion Iraq. The fact that this homicide rate has not yet attracted official attention suggests that the serial killer is Bob Arnold, the police chief, rather than Jake Tiptree or Ellie White.) Previous installments: 1--4, 5; sequel: 11.
Jonathan Israel, A Revolution of the Mind: Radical Enlightenment and the Intellectual Origins of Modern Democracy
The story, or part of the story, of how the outlandish and unprecedented ideology of a network of radical, subversive scribblers became what we all at least pay lip-service to. Really deserves a detailed discussion; I'll just say that there's a lot of fascinating material in here, but also many places where I felt he didn't really prove his point, even or especially when I was very sympathetic to what he was saying.
Stephen Budiansky, The Bloody Shirt: Terror After the Civil War
Once upon a time, the US Army attempted to bring democracy to a backward part of the world which had long been wracked by ethnic conflict. There were some promising beginnings, but the defeated, formerly dominant faction refused to accept that their relative demotion, and engaged in a vicious, well-organized campaign of terrorism, which ultimately proved to be entirely successful. Those who had trusted enough in the power and benevolence of the United States enough to participate in the governments ultimately overthrown by "violence and fraud" (in the words of one of the over-throwers) were lucky to escape with their lives (as many did not). Minimal democratic norms were not re-established for ninety years or more.
This is of course the story of the failed Reconstruction of the South after the civil war, which Budiansky tells by recounting the inter-cut, and occasionally overlapping, lives of a number of individuals on the Reconstruction side of the conflict. One of his more effective tactics is to quote extensively from their letters and journals, as well as from contemporary books and newspapers. Caveat lector: many of these — especially the newspapers — are full of vicious racist bile, as well as the astonishing lies elite white Southerners told to portray themselves as oppressed victims. (This begins with the story of the "bloody shirt" that opens the book.) This stuff was hard for me to stomach, and might be too much for some.
My biggest complaint with the book is that I wish Budiansky had done more to tell the stories of black Americans, the way he did with his white subjects — not that there are none, I hasten to add. I can guess at reasons why it would be harder to find materials (all of them ultimately having to do with the fact that Southern blacks were an oppressed people who emerged from slavery for a few years before being crushed back down to serfdom), but still... That said, Budiansky's story of crushed hopes, futile bravery and murderous hatred is wonderfully written and incredibly depressing. I hope that it fills many American with the sort of patriotic shame which helps us be better.
Luc Devroye and Gabor Lugosi, Combinatorial Methods in Density Estimation
The fundamental theorem of statistics, says Pitman, is the Glivenko-Cantelli theorem: the empirical distribution function Fn of a large sample of independent, identically-distributed random variables comes arbitrarily close to their true distribution function F: as n goes to infinity, maxx |Fn(x) - F(x)| goes to 0 almost surely. This means that we can learn any underlying probability distribution to arbitrary accuracy just by collecting enough data.* Unfortunately the empirical distribution function is always discrete, so it doesn't have a density, even if the underlying distribution does. Or, if you like, it has a density, but it's a mixture of Dirac delta functions. (The convergence is in the sense of "weak convergence" or "convergence in distribution".) Density estimation is basically about taking the empirical distribution function and smoothing it so that it has a well-behaved density. The oldest way of doing this is to build a histogram, which gives constant densities to intervals; other methods include fitting function series (Fourier or wavelet expansions) to the data, or using kernels (replacing each of the delta function spikes with a smooth density, say a Gaussian bell-shaped curve). The art here is to pick the manner of smoothing, and the amount of smoothing, so that (1) the convergence promised by Glivenko-Cantelli for the unsmoothed distribution is not just maintained but is (2) strengthened to convergence of the estimated density on the true density, and ideally (3) the latter convergence happens rapidly.
Devroye and Lugosi's book is devoted to establishing conditions under which common density estimators have these three desirable properties (or, more rarely, when they do not). Throughout, they focus on the "total variation" or L1 distance between densities: dTV(f,g) is the integral of |f(x) - g(x)| over all x. They mention, but generally avoid, other common distances or pseudo-distances such as L2 (integral of |f(x) - g(x)|2), Hellinger distance (too ugly to write in HTML), or relative entropy (Kullback-Leibler divergence, expected log-likelihood ratio). The total variation distance has a very natural probabilistic interpretation (the maximum amount by which the estimated probability of any event differs from its true probability), and they can get very nice finite-sample bounds by minimizing it over various classes of possible estimates, so this choice is eminently defensible; it does however cut them off from using a lot of existing theory. (For instance, the optimal coefficients in a Fourier series, from an L1 point of view, are not just the empirical Fourier coefficients, since the latter are L2 optimal.)
Their general goal is to prove finite-sample upper bounds on the L1 error of their density estimates; if these go to zero as n grows, we get (1) and (2) above, and the rate of convergence tells us how close we are to obtaining (3). Their route to this goal is almost always through VC theory, and empirical process theory more generally. As always, this has two parts: one is deviation inequalities (e.g., Hoeffding's) which bound the probability that any one candidate density will look much better in sample than it will look out of sample. The other part is combinatorial arguments that the behavior of an entire space of functions can be approximated by that of a finite number of key functions. Meshed together by a union bound, these give uniform concentration bounds, with rates of convergence depending on the complexity of the combinatorial construction needed to achieve a given degree of approximation (i.e., the VC dimension). Devroye and Lugosi's key theorems bound the error of their density estimates in terms of the VC dimensions of the sets formed by comparing two densities in the class. (Specifically, they are interested in the sets where one estimate is higher than another by a given amount; this is, as they note, extremely similar to the threshold procedure used to apply VC theory to regression problems.) Finite VC dimension for such sets implies convergence to within a constant factor of the best available approximation to the true density. They extend such results to ones where the amount of smoothing is determined by data-set splitting, i.e., dividing the data into a training and a testing set, and picking the degree of smoothing which best generalizes from the training set to the testing set. (They do not consider any other form of cross-validation, which is a shame because they're a lot more common than simple data-splitting, but understandable because they're very ugly to analyze.) They give a lot of attention to kernel density estimates, including bounds for continuous kernels in terms of how hard it is to approximate them by simple step-functions, for which the combinatorics are easy.
Strictly speaking, the book presupposes measure-theoretic probability, but readers uncomfortable with sigma-fields and Radon-Nikodym derivatives could mostly get away with ignoring the former and reading "probability density functions" for the latter. Similarly, the actual combinatorics are either elementary, or can be taken on trust. This book is probably not the best way to first encounter density estimation — I suspect a less theoretical introduction would not only make the ideas clearer, but also make readers want theoretical guidance — but no experience on that score is, strictly, necessary. Neither, really, is prior knowledge of learning theory or VC theory, though again it would probably help. The ideal situation for the book is, I'd guess, a second-year graduate-level course on density estimation (there are many excellent problems), or self-study.
*: Well, we have to pretend the data are IID, but let that slide. Or: assume sufficiently rapid strong mixing and argue, as in Vidyasagar, that VC results then hold with tolerable corrections. Kernel density estimates for stochastic processes are treated at length in Bosq's Nonparametric Statistics for Stochastic Processes: Estimation and Prediction, but the starting point there is ergodic theory, not learning theory.
George Clark, Science and Social Welfare in the Age of Newton
Connections between the scientific revolution, economic development and economic policy (such as it was) in late-17th and early-18th century England, and to a lesser extent France and the Netherlands. Interesting stuff on the connections between the activities of scientists and technological development, including the shrewd observation, contra Marxists claiming that scientific progress was basically directed to solving the capitalists' problems, that there were plenty of lucrative problems where scientists got nowhere, or didn't even try to get anywhere, because it was just not scientifically feasible. Also some interesting material on the early history of statistics. The first edition was published in 1937, and shows both that it was written during the Depression, and that respectable economists had no idea what was going on. (This does not much harm the book.)
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; The Progressive Forces; The Great Transformation; The Beloved Republic; Minds, Brains, and Neurons; Enigmas of Chance; The Continuing Crises; Cthulhiana

Posted by crshalizi at November 30, 2009 23:59 | permanent link

November 24, 2009

(I Don't Give a Damn About My) Bad Reputation

Attention conservation notice: Unskillful nattering about pop-culture ephemera.
For the sake of my own sanity, I prefer to remain ignorant of the occult processes by which the direct mail gods decide to which catalogues to send to which people. (There's too much dynamic programming involved.) Today, for instance, they decided to inflict upon me the official Barbie doll spring 2010 collection preview, and like a fool I couldn't resist looking through it. Thus my life is made that much worse by learning that there is a Joan Jett Barbie doll. (I thought about embedding an image, but in this case pain shared is not pain eased.) I think I finally grasp what people mean when they talk about later cultural products assaulting parts of their childhood, in this case one I didn't even realize I valued.

Manual trackback: Mostly Harmless

Linkage

Posted by crshalizi at November 24, 2009 19:27 | permanent link

November 21, 2009

"Homophily, Contagion, Confounding: Pick Any Three"

A number of people have asked for my slides from the MERSIH conference the other week. So, here they are. (Anyone who was at my talk at SFI about a year ago will recognize the title, and much of the content.) I'm presently turning this into a proper manuscript, so comments are welcome. Please don't rip it off; I'll become very cross and may even hold my breath until I turn blue and pass out, and won't you be sorry then?

Manual trackback: Cognition and Culture

Networks; Enigmas of Chance; Complexity; Self-Centered

Posted by crshalizi at November 21, 2009 18:32 | permanent link

November 19, 2009

"Statistical Analysis of Stellar Evolution" (Next Week at the Statistics Seminar)

In which the starry heavens above submit to statistical analysis:

David van Dyk, "Statistical Analysis of Stellar Evolution"
Abstract: Color-Magnitude Diagrams (CMDs) are plots that compare the magnitudes (luminosities) of stars in different wavelengths of light (colors). High non-linear correlations among the mass, color and surface temperature of newly formed stars induce a long narrow curved point cloud in a CMD known as the main sequence. Aging stars form new CMD groups of red giants and white dwarfs. The physical processes that govern this evolution can be described with mathematical models and explored using complex computer models. These calculations are designed to predict the plotted magnitudes as a function of parameters of scientific interest such as stellar age, mass, and metallicity. Here, we describe how we use the computer models as a component of a complex likelihood function in a Bayesian analysis that requires sophisticated computing, corrects for contamination of the data by field stars, accounts for complications caused by unresolved binary-star systems, and aims to compare competing physics-based computer models of stellar evolution.
This is joint work with Steven DeGennaro, Nathan Stein, William H. Jefferys, Ted von Hippel, and Elizabeth Jeffery.
Place and time: Doherty Hall A310, Monday, 23 November, 4--5 pm.

Enigmas of Chance; The Eternal Silence of These Infinite Spaces; Physics

Posted by crshalizi at November 19, 2009 12:02 | permanent link

November 13, 2009

"Some Things Statisticians Do at Google" (Next Week at the Statistics Seminar)

Attention conservation notice: Of no use to you unless (1) you want to know what statisticians do at search-engine companies and (2) you are in Pittsburgh.
Mike Meyer, "Some Things Statisticians Do at Google"
Abstract: I'll talk about a number of projects at Google where statisticians have made a large contribution. There will not be a lot of technical details. In some cases I will just describe the problem.
The major example will be a description of the statistical and engineering infrastructure to support live traffic experiments at Google.
A common theme of the problems is the importance of understanding basic statistical principles that can be applied and modified to handle new data and new circumstances.
Place and time: Monday, 16 November at 4 pm, in Doherty Hall A310

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at November 13, 2009 15:09 | permanent link

November 08, 2009

The Shadow Price of Power

Attention conservation notice: Quasi-teaching note giving an economic interpretation of the Neyman-Pearson lemma on statistical hypothesis testing.

Suppose we want to pick out some sort of signal from a background of noise. As every schoolchild knows, any procedure for doing this, or test, divides the data space into two parts, the one where it says "noise" and the one where it says "signal".* Tests will make two kinds of mistakes: they can can take noise to be signal, a false alarm, or can ignore a genuine signal as noise, a miss. Both the signal and the noise are stochastic, or we can treat them as such anyway. (Any determinism distinguishable from chance is just insufficiently complicated.) We want tests where the probabilities of both types of errors are small. The probability of a false alarm is called the size of the test; it is the measure of the "say 'signal'" region under the noise distribution. The probability of a miss, as opposed to a false alarm, has no short name in the jargon, but one minus the probability of a miss — the probability of detecting a signal when it's present — is called power.

Suppose we know the probability density of the noise p and that of the signal is q. The Neyman-Pearson lemma, as many though not all schoolchildren know, says that then, among all tests off a given size s, the one with the smallest miss probability, or highest power, has the form "say 'signal' if q(x)/p(x) > t(s), otherwise say 'noise'," and that the threshold t varies inversely with s. The quantity q(x)/p(x) is the likelihood ratio; the Neyman-Pearson lemma says that to maximize power, we should say "signal" if its sufficiently more likely than noise.

The likelihood ratio indicates how different the two distributions — the two hypotheses — are at x, the data-point we observed. It makes sense that the outcome of the hypothesis test should depend on this sort of discrepancy between the hypotheses. But why the ratio, rather than, say, the difference q(x) - p(x), or a signed squared difference, etc.? Can we make this intuitive?

Start with the fact that we have an optimization problem under a constraint. Call the region where we proclaim "signal" R. We want to maximize its probability when we are seeing a signal, Q(R), while constraining the false-alarm probability, P(R) = s. Lagrange tells us that the way to do this is to minimize Q(R) - t[P(R) - s] over R and t jointly. So far the usual story; the next turn is usually "as you remember from the calculus of variations..."

Rather than actually doing math, let's think like economists. Picking the set R gives us a certain benefit, in the form of the power Q(R), and a cost, tP(R). (The ts term is the same for all R.) Economists, of course, tell us to equate marginal costs and benefits. What is the marginal benefit of expanding R to include a small neighborhood around the point x? Just, by the definition of "probability density", q(x). The marginal cost is likewise tp(x). We should include x in R if q(x) > tp(x), or q(x)/p(x) > t. The boundary of R is where marginal benefit equals marginal cost, and that is why we need the likelihood ratio and not the likelihood difference, or anything else. (Except for a monotone transformation of the ratio, e.g. the log ratio.) The likelihood ratio threshold t is, in fact, the shadow price of statistical power.

I am pretty sure I have not seen or heard the Neyman-Pearson lemma explained marginally before, but in retrospect it seems too simple to be new, so pointers would be appreciated.

Manual trackback: John Barrdear

Updates: Thanks to David Kane for spotting a typo.

*: Yes, you could have a randomized test procedure, but the situations where those actually help pretty much define "boring, merely-technical complications."

Enigmas of Chance

Posted by crshalizi at November 08, 2009 03:06 | permanent link

November 04, 2009

Blosxom Fading in November

My old Blosxom installation (v. 2.0.2), after several years of working nicely, is growing increasingly cranky, and mulishly refusing to generate or update posts as the whim takes it. (I am not sure how much kicking and shoving it will need to produce this.) I'd appreciate a pointer to something which works similarly, but does work: I write posts in plain HTML in Emacs and drop them in a directory; it makes them look nice. If it handles tags and/or LaTeX nicely, so much the better.

Self-Centered

Posted by crshalizi at November 04, 2009 19:34 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems