←February→
| Sun |
Mon |
Tue |
Wed |
Thu |
Fri |
Sat |
| |
1 |
2 |
3 |
4 |
5 |
6 |
| 7 |
8 |
9 |
10 |
11 |
12 |
13 |
| 14 |
15 |
16 |
17 |
18 |
19 |
20 |
| 21 |
22 |
23 |
24 |
25 |
26 |
27 |
| 28 |
|
|
|
|
|
|
Archives
←2010→| Months |
| Jan |
Feb | Mar |
| Apr |
May |
Jun |
| Jul |
Aug |
Sep |
| Oct |
Nov |
Dec |
Categories
Self-Centered
Books to Read While the Algae Grow in Your Fur
Books I've read in the last month or so and
feel I can recommend (warning: I have no taste)
- Dexter 3
- Few things are quite so restorative when facing the winter blahs as a
well-made TV show that understands the true meaning and importance of
friendship and family ties.
- Marshall G. S. Hodgson, Rethinking World History: Essays on
Europe, Islam and World History
- Hodgson was a historian of Islam at the University of Chicago, best known
for his monumental and fantastic Venture of Islam
(I, II, III),
which was an attempt to tell the story of "conscience and history in a world
civilization". Both the "world" and the "civilization" part are important:
Hodgson was one of those historians who breaks the world into civilizations,
but didn't think of them as distinct organisms or similar weirdness; rather as
complexes of very broadly-distributed but also very involving traditions.
Moreover, the "world" part mattered a lot too: he constantly kept in view the
fact that civilizations were never isolated from each other, and their
interactions were vital to who they developed, particularly to "Islamicate"
civilization, which for a long time occupied the central position in the
"Afro-Eurasian Oecumene". The whole of it was an effort to see the history of
Islam as part of world history, and to see world history itself objectively.
He also tried very hard to try to inhabit and convey the moral universe of the
people he wrote about; this was partly about historical understanding and
partly about his own earnest Quaker conscience.
- Hodgson spent many, many years working on a world history, which was left
in an even more fragmentary state than The Venture of Islam at the
time of his death; an unpublishable mess. Rethinking World
History is a compilation of fragments this manuscript and selections
from The Venture, along with some journal papers and letters. The
product is an excellent epitome of Hodgson's more general and theoretical ideas
about history and historiography: the central role of Islam in world history
and the broad course of Islamicate civilization; the nature of tradition and
the very broad, diffuse complexes of traditions that constitute civilizations,
and the way all traditions constantly change; the errors of traditional
"orientalist" scholarship; the sheer unprecedented weirdness of the modern
"technical" age; the need to crush Eurocentrism if we're to understand history
(and in particular the "optical illusion" which makes us think there's a
"western civilization" going from ancient Greece through Rome to medieval
western Europe and modern European states and their off-shoots); and finally
the fundamental unity of human history, and how that manifested itself over
time.
- There is also an introduction by the editor, one
Edmund Burke
III, which is partly helpful, but also oddly dismissive of Hodgson.
However this dismissal just takes the form of saying Hodgson's "culturalist"
and doesn't acknowledge Immanuel Wallerstein and the more dodgy sort of
Marxist; Burke doesn't even mention a single material error or omission these
supposed flaws lead Hodgson into. While I appreciate Burke's work in pulling
together the book, I wish he'd thought harder before writing his
introduction.
|
February 04, 2010
Upcoming Gigs: Bristol
I am giving two talks in Bristol next week about (not so coincidentally) my
two latest papers.
- "The Computational Structure of Spike Trains"
- Bristol Centre for Complexity Sciences, SM2 in the School of Mathematics, 2 pm on Tuesday 9 February
- Abstract: Neurons perform computations, and convey the results of
those computations through the statistical structure of their output spike
trains. Here we present a practical method, grounded in the
information-theoretic analysis of prediction, for inferring a minimal
representation of that structure and for characterizing its
complexity. Starting from spike trains, our approach finds their causal state
models (CSMs), the minimal hidden Markov models or stochastic automata capable
of generating statistically identical time series. We then use these CSMs to
objectively quantify both the generalizable structure and the idiosyncratic
randomness of the spike train. Specifically, we show that the expected
algorithmic information content (the information needed to describe the spike
train exactly) can be split into three parts describing (1) the time-invariant
structure (complexity) of the minimal spike-generating process, which describes
the spike train statistically; (2) the randomness (internal entropy rate) of
the minimal spike-generating process; and (3) a residual pure noise term not
described by the minimal spike-generating process. We use CSMs to approximate
each of these quantities. The CSMs are inferred nonparametrically from the
data, making only mild regularity assumptions, via the causal state splitting
reconstruction algorithm. The methods presented here complement more
traditional spike train analyses by describing not only spiking probability and
spike train entropy, but also the complexity of a spike train's structure. We
demonstrate our approach using both simulated spike trains and experimental
data recorded in rat barrel cortex during vibrissa stimulation.
- Joint work with Rob Haslinger and Kristina Lisa Klinkner.
- "Dynamics of Bayesian updating with dependent data and misspecified models"
- Statistics seminar, Department of Mathematics, Seminar
Room SM3, 2:15pm on Friday 20 February
- Abstract: Much is now known about the consistency of Bayesian
non-parametrics with independent or Markovian data.. Necessary conditions for
consistency include the prior putting enough weight on the right neighborhoods
of the true distribution; various sufficient conditions further restrict the
prior in ways analogous to capacity control in frequentist nonparametrics. The
asymptotics of Bayesian updating with mis-specified models or priors, or
non-Markovian data, are far less well explored. Here I establish sufficient
conditions for posterior convergence when all hypotheses are wrong, and the
data have complex dependencies. The main dynamical assumption is the asymptotic
equipartition (Shannon-McMillan-Breiman) property of information theory. This,
plus some basic measure theory, lets me build a sieve-like structure for the
prior. The main statistical assumption concerns the compatibility of the prior
and the data-generating process, bounding the fluctuations in the
log-likelihood when averaged over the sieve-like sets. In addition to posterior
convergence, I derive a kind of large deviations principle for the posterior
measure, extending in some cases to rates of convergence, and discuss the
advantages of predicting using a combination of models known to be
wrong.
- (More on this paper)
I'll also be lecturing
about prediction, self-organization
and filtering to the BCCS
students.
I presume that I will not spend the whole week talking about
statistics, or working on the next round of papers and lectures; is there, I
don't know, someplace in Bristol to hear music or something?
Self-centered;
Enigmas of Chance;
Complexity;
Minds, Brains, and Neurons
Posted by crshalizi at February 04, 2010 13:48 | permanent link
January 31, 2010
Books to Read While the Algae Grow in Your Fur, January 2010
- Virginia Swift, Hello, Stranger
- Enjoyable mystery with eccentric academics, God-botherers and
gentrification in present-day Laramie. Nth book in a series; I'll keep an eye out for the
others.
- Intelligence
- Smart crime/spook drama set in one of the most attractive cities in the
world (Vancouver), which could only be improved if it didn't end in the WORST
CLIFFHANGER EVER. (Ahem.) Not, of course, as good as The Wire,
but then nothing is.
- Daniel Waley, The Italian City-Republics
- Short, readable political-institutional history of the communes of northern
and central Italy. He begins with the communes starting to take form in the
towns and wrest control from their bishops, say around 1000, and ends by about
1400, by which point the towns had almost all, except
for Venice,
descended into some form of monarchy, generally under the domination of the
local feudal land/war-lords. (Waley says little about Venice, which in
retrospect seems odd, though it didn't strike me while reading it.) While
Waley is good at describing this historical trajectory, he says little about
why so many Italian cities followed it. I'd think it'd be natural to
compare the Italian case to contemporary cities elsewhere, but I think there is
exactly one sentence on them. (I imagine all kinds of interesting comparative
work could be or has been done.) But within those limits, it's a nice book.
Waley has also written studies on Siena and Orvieto, which sound interesting.
- Terry Pratchett, Nation
- You don't really need me to recommend Terry Pratchett to you, especially
when he's writing about how people find ways to go on when their world has been
pointlessly destroyed.
- Richard
Hofstadter, Anti-Intellectualism in American
Life
- Astonishingly, this still feels like it fits after a lapse of half
a century. The whole
"tax-raising,
latte-drinking, sushi-eating, Volvo-driving, New-York-Times-reading,
body-piercing, Hollywood-loving, left-wing freak-show" nonsense of the last
thirty years now makes a lot more sense; and the chapters about the history of
American education were frankly a revelation to me. (The chapter on Dewey and
his pedagogical influence seems like a model of being respectfully but
unrelentingly critical.) No doubt for real historians, this is all painfully
outdated, and whatever's actually sound has long since been incorporated into
other works, which don't provide such unintentional moments of amusement as,
when listing the unfair accusations heaped on Jefferson, including keeping a
slave mistress and having children by her. (For that matter I don't care for
the Beats very much, but they certainly contributed more to our literature than
he thought they would.) Still: the man could write.
- ObLinkage: Steve Laniel on AIiAL.
- D. N. MacKenzie (trans.), Poems from the Divan of Khushâl Khân Khattak
- The first significant body of poetry in
Pashto; Khushal
was a 17th century warlord in what is now the Northwest Frontier, owing his
position to a combination of tribal authority and appointment by the Mughals.
This seems to be the most recent translation of a selection from his poetry in
English, dating from 1965. It is arranged on no particular principles (some
Pashto editions are, following tradition, arranged alphabetically by the first
letter of the poem), which produces a rather odd effect, that I might summarize
as follows: Khushal is happily in love: wow is the beloved a hottie. Khushal
is unhappily in love: separation is awful, especially if it's because the
beloved doesn't want to see Khushal. Khushal is a fierce warrior who is also a
keen hunter; falconry rules. Khushal has a remarkable capacity for drink. (Go
ahead, try and tell me that's allegorical.) Aurangzeb sucks, especially in
comparison to his father.
(Well, he did, and
sticking Khushal in jail can't have won him any points.) The Afghans should
rally to Khushal and defeat Aurangzeb! Men are treacherous, false-faced
bastards, but Afghans are really worse than the rest. (To be fair, having one
of your own sons wage war on you in the name of Aurangzeb has got to be pretty
embittering.) Khushal will withdraw from the sinful world and spend his days
in pious penance. Khushal glorifies God. Repeat.
- My grandfather's extemporized translations were better English poetry, but
I will never hear those again.
- Moez Draief and Laurent Massoulié, Epidemics and Rumors
in Complex Networks
- A nice short (< 120 pp.) account of the connections among stochastic
network models, branching processes, and epidemic models, of the
"susceptible-infectious-susceptible" or "susceptible-infectious-recovered"
type, including epidemics on networks. ("Rumors" are assumed to fall under
such models.)
- They begin with the basic Galton-Watson branching process model, where each
member of a population produces a random number of descendants (possibly zero),
independently of everyone else, and this distribution is constant both within
and across generations. Following over a century of tradition, they look at
whether the population survives forever or goes extinct, how large it gets, how
long it takes to go extinct if it does, etc. This then gets turned into a
simple epidemic model ("member of population" = infected individual). It also
maps on to the Erdos-Renyi network model, with "has an edge with" taking the
place of "is a descendant of": pick your favorite node, and connect it to a
random selection of other nodes, the number following a binomial distribution;
connect each of them in turn to more random nodes. The size of the branching
process's population corresponds to the size of the connected component in the
graph. The mapping really only really works in the limit of low-density graphs
(the size of the component is roughly a sum of independent quantities
when there are no loops), but it's enough to study the emergence of a giant
component and the behavior of the diameter of the graph. As a prelude to more
sophisticated models, they then prove a form
of Kurtz's Theorem on the convergence of
Markov chains to ordinary differential equations in the large-population limit.
The second half of the book
rehearses Watts-Strogatz
small-world and
Barabási-Albert
scale-free networks (including mention of Yule but not, oddly,
of Herbert Simon), before
wrapping up with epidemic models on graphs, and the "viral marketing" problem
of deciding where, on a known and fixed network, to start an epidemic for
maximum impact.
- Of course, since it's a mathematics book, the problem of how to link these
models to data isn't even dismissed.
- This isn't a ground-breaking work, but it's nice to have all this in a
single book, and one a bit more accessible than, say, Durrett's
Random
Graph Dynamics (though by the same token less comprehensive). The
implied reader is comfortable with stochastic processes at the level of
something
like Grimmett
and
Stirzaker; measure-theoretic
issues are avoided, even when discussing Kurtz's Theorem. (Their version
is thus much less precise and powerful than his, but vastly easier to
understand.) Anyone comfortable with that level of probability could read it
without much trouble, and I'd happily use it in a class.
- Disclaimer: I read a draft of the manuscript for the publisher
in 2007, and they sent me a free copy of the book, but I have no stake in its
success.
- Joseph L. Graves, Jr., The Emperor's New Clothes: Biological
Theories of Race at the Millennium
- There are places where he lapses into biological jargon, and others where I
think lay readers would have benefited from more detailed rebuttals of the
common counter-arguments, but over-all I recommend this very strongly. (Thanks
to I.B. for lending me her copy.)
- Pascal
Massart, Concentration
Inequalities and Model Selection
- Using empirical
process theory, and more specifically concentration of
measure, to get finite-sample, i.e., non-asymptotic, risk bounds for
various forms of model
selection. The basic strategy is to find conditions under which every
model in a reasonable class will, with high probability, perform about as well
on sample data as they can be expected to do on new data; this involves
constraining the richness or flexibility of the model class. A little extra
work, and the addition of suitable penalties to the fit, gets bounds that
extend over multiple classes of model, even over a countable infinity of
classes. Among other highlights, Massart shows why the famous AIC
heuristic is often definitely sub-optimal, and how to correct it; it
also offers corrections to Vapnik's (much
better) structural risk minimization,
and a nice treatment of data-set splitting (= 1-fold cross-validation). All of
this is for IID data, so the usual caveats apply. Formally self-contained, but
realistically some previous exposure to empirical processes (at the level of
Pollard's notes if not higher) will be
needed. Available for free as
a large PDF
preprint, but I found it much more convenient to read a dead-tree
copy.
- Elizabeth Bear, New Amsterdam
- Alternate-history fantasy mystery stories. Owing something, perhaps, to
Randall Garrett's "Lord Darcy" stories (the name of the heroine is distinctly
suspicious), but without their complacency about the benevolence of the powers
that be.
- David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining
- I've used this three times now in
teaching 36-350, with
about 75 students total over the years. I keep using it because it's the best
textbook on data-mining I know. It
covers the whole process, soup to nuts: data collection (and the importance of
understanding what the data actually mean, if anything), cleaning, databases,
model construction, model evaluation, optimization, visualization, etc. All of
this is organized around four crucial questions: what kind of pattern are we
looking for in the data, and how do we represent those patterns? how do we
score representations against each other? how do we search for good
representations? what do we need to do to implement that search efficiently?
All of the basic methods (and many not so basic ones) are in here, all seen as
different answers to these questions. I find its explanations extremely clear,
and my students seem to as well. I regard it as a strength that it
is not tied to pre-canned software, which would only encourage
dependency and thoughtlessness.
- The only real competition, to my mind,
is Hastie,
Tibshirani and Friedman. But the Stanford book is distinctly more
about statistics, and has more statistical theory and math (though
not, from my point of view, a lot of either), whereas this one is
distinctly focused on data-mining and on computation. It would be nice if Hand
&c. had material on support vector machines, and more on ensemble methods;
perhaps it's time for a second edition?
- Disclaimer: I almost took a post-doc under Smyth rather than
coming to CMU, back in 2004; also, the MIT Press sent me a free review copy of
this book (in 2001).
Books to Read While the
Algae Grow in Your Fur;
Pleasures of Detection, Portraits of Crime;
Enigmas of Chance;
Scientifiction and Fantastica;
Writing for Antiquity;
Afghanistan and Central Asia;
The Natural Science of the Human Species;
Networks;
The Beloved Republic;
The Commonwealth of Letters;
Learned Folly
Posted by crshalizi at January 31, 2010 23:59 | permanent link
January 19, 2010
The Work of Art in the Age of Mechanical Reproduction
Attention conservation notice: 800+ words of inconclusive
art/technological/economic-historical musings.
This
thread over at Unfogged reminds me of something that's puzzled me for
years, ever since
reading this:
why didn't prints displace paintings the same way that printed books displaced
manuscript codices? Why didn't it become expected that visual artists, like
writers, would primarily produce works for reproduction? (No doubt, in that
branch of the wave-function*, obsessive fans still want to get the original
drawings, but obsessive fans also collect writer's manuscripts, or
even their typewriters, as
well as their mass-produced books.) 16th century engraving technology was
strong enough that it could implement powerful works of art
(vide),
so that can't be it. And by the 18th century at least writers could make a
living (however precarious) from writing for the mass public, so why didn't
visual artists (for the most part) do likewise? (Again,
it's manifestly
not as though technology has regressed.) Why is it still the case
that a real, high-class visual artist is someone who makes one-offs? I know
that reproductions have been important since at least the late 1800s, but for
works and artists who first made their reputation with unique, hand-made
objects, which is as though the only books which got sent to the printing press
were ones which had already circulated to acclaim in manuscript.
Some possibilities I don't buy:
- Aesthetic limitations. There are valuable effects which can be
achieved with a big original painting which prints just can't match.
Response: there are effects you can achieve with an illuminated,
calligraphic manuscript which you can't match with movable type, either. Those
weren't valuable enough to keep printed books from taking over. Why the
difference? Why not a focus on what can be done through prints, which
is quite a lot? (Witness the experience of the 20th century and later,
when most art lovers know most works of art they enjoy through reproductions.)
- Color. A real limitation; even today, getting color done well in
mass visual media is not entirely trivial
(cf.),
and early modern Europe certainly couldn't do it at all. Response:
What makes color so important? We know that some great art was made without
its benefit, and we don't really know how much better it could have gotten had
prints been the medium of choice. Even if color was all that, it just
pushes the shift to the late 19th century.
- Artists too expensive. Whether you are producing one painting or
a thousand prints, there is a considerable fixed cost to the artist's time and
training. (The first print is very expensive.) Individual patrons
could afford this; the mass public could not. Response: The same
argument would apply to books. Besides, high fixed costs usually
drive towards seeking a wider market, so that the fixed costs are
distributed over a larger number of people. The argument would have to be one
of failure of demand — that where there was one man willing to pay 100
guilders (or whatever) for a painting, there were not, say, 120 people willing
to pay 1 guilder for prints. Why not?
- Paintings too cheap. There have always been too many people
wanting to be visual artists for them to all make a living as original artists.
One of the things they could do instead was paint copies. Response:
The economy of scale problem still applies.
- States too weak. In a competitive market, market prices equal
marginal costs. The marginal cost of producing another copy of a print is
very, very low, so low that the fixed costs of drawing and designing it in the
first place aren't recouped. As usual, then, competitive markets fail
massively at producing informational goods. The modern solution is to
institute and vigorously enforce intellectual property rights. These are
monopoly privileges which the state grants to certain individuals; if anyone
tries to compete with these favorites of the powers that be, then "goons with
guns" (as my libertarian friends like to say) come to stop them. Doing this
requires a really massively powerful and intrusive state, which is a relatively
recent phenomenon, and not to be lightly deployed on behalf
of artists, of all people. Artists who tried to go the
mass-production route would've been even more starvation-prone than those who
didn't attempt it. Response: An exactly parallel argument would
explain why writers didn't embrace printing.
- The revolution has happened. The overwhelming majority of visual
artists do aim their work at reproduction; it's just a small minority
which continues to produce one-offs. This minority has, however, a lot more
cultural prestige. Response: There's some merit to this, but it's
bizarre and anomalous; it's not as though our really high-class
literature was still illuminated or calligraphic manuscripts, and printing was
reserved for declassé "commercial" work.
The most convincing argument I've been able to come up with has to do with how
visual artworks were and are used. Even in manuscript, books were for
reading: private consumption, or near enough. European culture, however,
provided a steady stream of demand for works of visual art for public
display, which is rather different. It were just a matter of pictures
you'd like to look at for your own enjoyment, perhaps prints would serve. But
if it's about decorating the church/guildhall/imposing estate, then you need
a unique painting of St. Jerome/the burgomasters/the master of the
house. The main point is that the owner has the resources to command their
very own artwork, not the work's intrinsic aesthetic properties (which
good reproductions would share). But even then, why not develop a second
stream of reproducible artwork for private rather than conspicuous consumption?
And indeed why not try to achieve similar effects
in print,
thereby broadcasting the message?
Updates, 31 January 2010: In
correspondence, Elihu
Gerson points to
an interesting-looking
book relevant to the social-use explanation.
Also, it seems I should clarify that I am not asking why (as
Vukutu
puts it) "people desire original works of visual art rather than printed
reproductions". If you are going to paint in oils on canvas, then of course
making a flat print of the result going to lose some detail of the physical
object, and those details might contribute in important ways to people's
experience of the object; there might be a real esthetic loss to looking at a
reproduction of a painting. What I am asking is why then we do not produce
artworks which are designed for reproduction. Or rather, we do
produce lots of such art, but it's not seen as very valuable, and generally not
even real art in the honorific sense. "Printed reproductions of physical
paintings lose valuable details" does not answer "Why did our visual arts
continue to focus on making one-off works?", unless you perhaps you add some
extra premises, like (i) no print-reproducible image could be as esthetically
valuable as a three-dimensional painting, and (ii) that difference in intrinsic
quality was extremely important to the people who consumed art, and I am very
dubious about both of these.
Finally, I don't think it's sufficient to point to "tradition", since
traditions change all the time. That deserves another argument, but another
time. In lieu of which, I'll just offer a quotation from a favorite book,
Joseph (Abu Thomas)
Levenson's Confucian
China and Its Modern Fate; he is writing about ideas, but as he
makes clear, what he says applies just as much to aesthetic or practical
choices as to intellectual ones.
With the passing of time, ideas change. This statement is ambiguous, and less
banal than it seems. It refers to thinkers in a given society, and it refers
to thought. With the former shade of meaning, it seems almost a truism: men
may change their minds or, at the very least, make a change from the mind of
their fathers. Ideas at last lose currency, and new ideas achieve it. If we
see an iconoclastic Chinese rejection, in the nineteenth and twentieth
centuries, of traditional Chinese beliefs, we say that we see ideas changing.
But an idea changes not only when some thinkers believe it to be outworn but
when other thinks continue to hold it. An idea changes in its persistence as
well as in its rejection, changes "in itself" and not merely in its appeal to
the mind. While iconoclasts relegate traditional ideas to the past,
traditionalists, at the same time, transform traditional ideas in the present.
This apparently paradoxical transformation-with-preservation of a
traditional idea arises form a change in its world, a change in the thinker's
alternatives. For (in a Taoist manner of speaking) a thought includes what its
thinker eliminates; an idea has its particular quality from the fact that other
ideas, expressed in other quarters, are demonstrably alternatives. An idea is
always grasped in relative association, never in absolute isolation, and no
idea, in history, keeps a changeless self-identity. An audience which
appreciates that Mozart is not Wagner will never hear the
eighteenth-century Don Giovanni. The mind of a nostalgic European
medievalist, though it may follow its model in the most intimate, accurate
detail, is scarcely the mirror of a medieval mind; there is sophisticated
protest where simple affirmation is meant to be. And a harried Chinese
Confucianist among modern Chinese iconoclasts, however scrupulously he respects
the past and conforms to the letter of tradition, has left his complacent
Confucian ancestors hopelessly far behind him...
An idea, then, is a denial of alternatives and an answer to a question.
What a man really means cannot be gathered solely from what he asserts; what he
asks and what other men assert invest his ideas with meaning. In no idea does
meaning simply inhere, governed only by it degree of correspondence with some
unchanging objective reality, without regard to the problems of its
thinker. [pp. xxvii--xxviii; for context, this passage was first published in
1958]
*: With apologies to the blogger formerly known as "the blogger formerly known as 'The Statistical Mechanic' ".
Manual
trackback: Mostly
Hoofless; 3
Quarks
Daily; Cliopatria (!);
Vukutu.
Writing for Antiquity
Posted by crshalizi at January 19, 2010 22:01 | permanent link
December 31, 2009
Books to Read While the Algae Grow in Your Fur, December 2009
- Duplicity
- It'd be a spoiler to simply to count the number of layers of trickery
here; and it's romantic; what more could you want? (Recommended
by Kate
Nepveu.)
- Burrowers
- Creepy, grim little western horror movie. The ecology almost
makes sense even. (No purchase link just because Powell's doesn't seem
to sell it.)
- Nick Abadzis, Laika
- The story
of Korolev, the
Soviet space program, and of course the eponymous heroine,
the first terrestrial creature in
space. I kept
muttering "The
dog dies at the end", but by the end it mattered to me that the dog
died.
- Seamus
Cooper, The
Mall of Cthulhu
- Yes, it's about an ancient alien squid-god trying to destroy Life As We
Know It via a shopping mall in suburban New England, with all the usual
indescribable horrors, and lots of joking references to previous works in the
genre ("Ms. Harker"!). But also: a convincingly unidealized yet affecting
friendship. — Apparently there will be a sequel; I will read
it very eagerly.
- John Layman and Rob
Guillory, Chew:
Taster's Choice
- Unquestionably the finest, and grossest, detective story about food and
black-market poultry ever, at least among those executed as comic books.
- Susan Hough, Predicting the Unpredictable: The Tumultuous Science
of Earthquake Prediction
- Full (positive) review coming later for a magazine.
- Philip
Kitcher, In
Mendel's Mirror: Philosophical Reflections on Biology
- Collection of Kitcher's papers about the philosophy of biology and related
issues, mostly tied (as the subtitle suggests) to genetics. The most
interesting paper for me was "Developmental Decomposition and the Future of
Human Behavioral Ecology"
[JSTOR], about what'd be involved
in doing something like evolutionary
psychology properly. (I should warn readers of that chapter
— Kitcher doesn't do so properly — that he takes as his case
study the explanation of incest avoidance, which leads him into a detailed
examination of the situations where incest is not avoided. There are
sound reasons for this, but it's not for the squeamish or, I'd imagine, the
traumatized.) Those who like this sort of thing will find it to be just the sort of thing they like.
- Sarah Graves, The Book of Old Houses
- Astonishingly, I have yet to experience series fatigue after eleven books.
The little bits of Lovecraftian atmosphere add here are, thankfully, debunked
inside the story. (Previously: 1--4,
5, 6--10)
- Rosemary Kirstein, The Language of Power
- The continuation of what is
at once an epic fantasy full of marvels and an inspiring depiction of the life
of the mind. The scene with Rowan, Will, and the pair of invisible dragons
will linger in my memory, and I found myself pleased and astonished at
Kirstein's depiction of the sheer beyond-all-experience strangeness of
magic. (There, I think I have avoided spoilers for once.)
- My only complaint: where is the rest of the series?!?? I want
them now!!!!
- A. R. Luria, Cognitive Development: Its Social and Cultural
Foundations
- Re-read after a lapse of ten years. I still think it's a fascinating and
profound, though also flawed, work; the successes now loom larger for me than
the flaws, though the latter are very real. (To recapitulate what I've
written elsewhere: Uzbekistani peasants in the 1930s
had excellent reasons to play dumb when Soviet officials came around
asking bizarre and leading questions, especially about foreign countries, or
premised on obvious falsehoods.) Two things which now impress me more: first,
the stuff about visual illusions and colors; and, second, the demonstration
that the subjects could solve more concrete problems which were formally
identical to the ones they couldn't, or wouldn't, solve in abstract or
contrary-to-fact form.
- Hans Reichenbach, The Direction of Time
- One of the greatest of
the logical positivists
takes a whirl at reconciling time-reversible microscopic physics with
irreversible macroscopic processes in 1956. I began reading this a long time
ago, then bogged down in the last chapter, on quantum statistical mechanics; I
took the occasion of a long plane flight to re-read and finish it, and am very
glad I did. The discussion of relativity, thermodynamics and ergodic theory is
clear and sound, if not — at least now — ground-breaking. (It
seems extremely odd that general relativity is so ignored; but perhaps just as
well, since cosmology was about to be revolutionized.) One highlight for me
was the idea of "branch systems", and using the consistency of arrows of time
across nearly-isolated mixing processes (not called that) to construct a more
global arrow of time. Even the chapter on quantum effects was more interesting
than I though it would be, being mostly concerned with the identity (or lack
thereof) of quantum particles through time, though I think the treatment
in Teller is superior.
- The most fascinating part of the book for me, however, is Reichenbach's
efforts to build up a notion of time which has not just an order but a
direction from causal relations. (If we pick an axis in space, as he
says, it has two equal good orders, say left-to-right or right-to-left; time is
not just ordered but directed, past-to-future.) He develops in considerable
detail the theme that edges in the causal network of spatio-temporal events can
be oriented based on the principle that dependent events become independent
conditional on their common causes. This is incredibly close to
modern ideas about inferring the structure of causal graphical models (see
Spirtes, Glymour and Scheines below; Glymour studied under
two of Reichenbach's students, Wesley Salmon and Cynthia Schuster). Sadly, I
would almost say tragically, however, Reichenbach makes the crucial mistake of
thinking that the same sort of independence can easily happen conditional on
common effects, when actually it almost never does. (My marginal note
at this point is, I see, "NOOOO!") Arguably, this delayed the development of
causal inference for decades.
- Reichenbach was drawing on many different areas of physics and mathematics
which have all made a lot of progress in the last half-century, so I am a bit
uneasy about recommending it unreservedly to non-specialists. (There
is a new
book I can recommend, unreservedly and
even reversibly,
to general readers.) But the core ideas are very much right, and it's still an
imposing and inspiring piece of work.
- ObLinkage 1: Emerson
on Reichenbach on time.
- ObLinkage 2: Speaking of Bérubé (as I was, parenthetically),
Steven
Gimbel's "If
I Had A Hammer: Why Logical Positivism Better Accounts for the Need for Gender
and Cultural Studies" tries to appropriate Reichenbach, and logical
positivism more generally, for the forces of political correctness.
- L. G. Godfrey, Misspecification Tests in Econometrics: The Lagrange
Multiplier Principle and Other Approaches
- Lots and lots about checking for whether you have the wrong terms in your
parametric (especially linear-and-Gaussian) model. Less fundamental than the
approaches of White
or Hart, and also better
adapted to the background and habits typical of econometricians. (This is no
accident.)
- (Oh, the Lagrange multiplier principle? Suppose your model imposes some
restrictions on the parameters, as compared to some larger model you can embed
it in. Imagine estimating your model by doing a constrained maximization of
the likelihood in the larger model; how big does the Lagrange multiplier on
your constraints have to be? How much are you paying in likelihood, in other
words, to enforce the constraint? If your model is true, then for large
samples the cost is very small and the Lagrange multiplier tends towards
zero.)
- Warren Ellis and Paul Duffield, Freakangels, vol. 3
- In which we consider various forms of rebuilding.
- Peter Spirtes, Clark Glymour and Richard Scheines, Causation, Prediction and Search
- Re-read as part of preparing for
my lecture
on casual discovery. I spent much of the winter of 2000 working my way
through the first edition, and wound up completely imprinted on its way of
thinking about what causal relationships are, how we should reason about them,
and how we can find them from empirical evidence. On causation and prediction
it now has an equal in Pearl's book (and I admit the latter looks
prettier), but on search, that is, on discovering causal structure, there is
still no rival. Their key observation is that even though correlation does not
imply causation, correlations must have causal explanations. (This idea goes
back to Herbert
Simon, and Hans Reichenbach [see above] at least.) So patterns of
correlations, among more than just two variables, constrain what causal
structures are possible. Sometimes they constrain the causal structure
uniquely, in other cases it's
only partially identified by
the dependencies. And of course there is always the possibility of making a
mistake with limited data. But none of this is any different for causal
discovery than it is for any other form of statistical inference. The great
contribution of this book is showing that causal discovery can be just
another learning problem. They have transformed metaphysical misery into
ordinary statistical unhappiness.
- (I can't resist illustrating, though it's necessarily a bit involved.
Take three variables, call then X, Y and Z. We find
that there is a correlation between X and Y which we can't make
go away, no matter what we control for, and likewise between Y and
Z, but not between X and Z. There are four possibilities
compatible with this: the causal chain X->Y->Z; the
opposite causal chain from Z to X; a "fork" where Y the
common cause, X<-Y->Z; and a "collider" or
"conjunction" where Y is the common
effect, X->Y<-Z. In the first three cases, Y
"screens off" X from Z — those variables are independent of
each other, conditional on Y. So the absence of conditional
independence definitely tells us which way the causal links point. In fact,
conditional independence at a collider, while mathematically possible, requires
no-margin-for-error adjustment of the parameters, so if we assume that such
conspiracies are absent ["faithfulness"], we have conditional dependence if and
only if there's a collider, which gives us the direction of causation from
correlations. "Orienting" some correlations in this manner induces
orientations in others, distinguishing forks and the two kinds of chain. For
more, see the aforementioned lecture notes, or indeed this book.)
- Disclaimers: All three authors have appointments in the CMU
Machine Learning department which I'm also affiliated with, etc., etc. And the
MIT Press sent me a free copy for review in 2001. (There is a reason my totem
is a sloth, yes?)
- Nunzio DeFilippis, Christina Weir, Brian Hurtt and Arthur Dela Cruz, Skinwalker
- Starts off as a procedural psycho-killer-in-Indian-country mystery, and then gets... strange.
Books to Read While the
Algae Grow in Your Fur;
Scientifiction and Fantastica;
Pleasures of Detection, Portraits of Crime;
Enigmas of Chance;
Philosophy;
Physics;
The Great Transformation;
Minds, Brains, and Neurons;
Afghanistan and Central Asia;
The Progressive Forces;
Biology;
The Eternal Silence of These Infinite Spaces;
Cthulhiana;
The Dismal Science
Posted by crshalizi at December 31, 2009 23:59 | permanent link
Output Summary
After long, long journeys, in one case going back to 2003, some papers have
come out. Alphabetically by distinguished co-authors:
- Aaron Clauset, CRS,
and M. E. J. Newman,
"Power-law distributions in empirical
data", arxiv:0706.1062 =
SIAM Review 51 (2009): 661--703
- I wrote about this when we
first submitted it. In the intervening two and a half years, many people
have continued to make the baby Gauss cry by publishing, and publicizing,
supposed
power laws based on completely inadequate and unreliable methods. Because
their methods are unsound, one has no idea whether they're right or
not, short of re-analyzing the data properly. I sometimes imagine these
authors singing
I could be right
I could be wrong
I feel nice when I sing this song
but many of them at least pretend to care about the
truth of their claims, so
I piously hope that in
the fullness of time the community of inquirers will come around to using
reliable methods. In which regard I am gratified, but also astonished,
to see that this is already the
most-cited
paper I've contributed to, by such a large margin that it's unlikely
anything else I do will ever rival it.
- See also: Aaron.
- Rob Haslinger, Kristina Klinkner and CRS, "The Computational Structure of
Spike Trains", arxiv:1001.0036 = Neural
Computation 22 (2010): 121--157
- I haven't written about this one before, though I feel free to do so now
that we're published. This was fun venture into applying state-reconstruction
ideas, specifically CSSR, to neural spike
trains, specifically
the barrel cortex of
the rat, which is it represents sensory input from the whiskers. (The
experimentalists build special whisker-vibrating machines, which are actually
quite impressive.) We do, I think, a pretty good job of predicting the spike
trains in an entirely non-parametric way, and showing how their complexity is
modulated by sensory stimuli — how much tweaking the whisker drives the
cortical neuron.
- CRS, "Dynamics of Bayesian Updating with Dependent Data and Misspecified
Models", arxiv:0901.1342 =
Electronic Journal of
Statistics 3 (2009): 1039--1074
- I also wrote about this
when I first submitted it. I'm particularly grateful to one of the reviewers,
who read the paper very carefully, totally got it, and provided many helpful
suggestions, one of which grew into a new theorem on rates of convergence.
Thank you, benevolent and thoughtful anonymous referee person! Also, the
publication process at EJS was extremely fast and utterly painless.
Other output: my first hemi-demi-semi-co-supervised student graduating with
his doctorate (a fine piece of work I wish I could link to); a paper draft
finished and sitting on a collaborator's desk (no pressure!);
the homophily paper is almost finished (I need to speed
up some simulations and cut out most of the jokes); half-a-dozen referee
reports of my own (a deliberate new low; made easier by boycotting Elsevier);
five papers edited for Annals of
Applied Statistics (a new high); nine lectures newly written or
massively revised for 36-350; all the problem sets for
350 re-worked and much better; three books reviewed
for American Scientist (and a whole bunch
of mini-reviews for nowhere in particular).
On the other hand, no chapters finished for Statistical Analysis of
Complex Systems; three very patient collaborators in different parts of
Manhattan waiting for me to turn things around; one superhumanly patient
collaborator in Santa Fe ditto; and one project which has been accreting since
2007 really needs to be cut and polished into some papers. Resolution for next
year: more papers.
Self-Centered;
Enigmas of Chance;
Complexity;
Power Laws;
Minds, Brains, and Neurons
Posted by crshalizi at December 31, 2009 18:45 | permanent link
December 28, 2009
Significance, Power, and the Will to Believe
Attention conservation notice: 2100 words on parallels
between statistical hypothesis testing and Jamesian pragmatism; an idea I've
been toying with for
a decade
without producing anything decisive or practical. Contains algebraic symbols
and long quotations from ancient academic papers. Also some history-of-ideas
speculation by someone who is not a historian.
When last we saw the Neyman-Pearson lemma, we were
looking at how to tell whether a data set x was signal or noise,
assuming that we know the statistical distributions of noise (call it p)
and the distribution of signals (q). There are two kinds of mistake we
can make here: a false alarm, saying "signal" when x is really noise,
and a miss, saying "noise" when x is really signal. What Neyman and
Pearson showed is that if we fix on a false alarm rate we can live with (a
probability of mistaking noise for signal; the "significance level"), there is
a unique optimal test which minimizes the probability of misses --- which
maximizes the power to detect signal when it is present. This
is the likelihood ratio test, where we say "signal" if and only
if q(x)/p(x) exceeds a certain threshold picked to
control the false alarm rate.
The Neyman-Pearson lemma comes from
their 1933 paper; but the
distinction between the two kinds of errors, which is clearly more fundamental.
Where does it come from?
The first place Neyman and/or Pearson use it, that I can see, is their 1928
paper
(in two parts),
where it's introduced early and without any fanfare. I'll quote it, but with
some violence to their notation, and omitting footnoted asides (from p. 177 of
part I; "Hypothesis A" is what I'm calling "noise"):
Setting aside the possibility that the sampling has not been random or that the
population has changed during its course, x must either have been drawn
randomly from p or from q, where the latter is some other
population which may have any one of an infinite variety of forms differing
only slightly or very greatly from p. The nature of the problem is such
that it is impossible to find criteria which will distinguish exactly between
these alternatives, and whatever method we adopt two sources of error must
arise:
- Sometimes, when Hypothesis A is rejected, x will in fact have been drawn from p.
- More often, in accepting Hypothesis A, x will really have been drawn from q.
In the long run of statistical experience the frequency of the first source
of error (or in a single instance its probability) can be controlled by
choosing as a discriminating contour, one outside which the frequency of
occurrence of samples from p is very small — say, 5 in 100 or 5 in
1000. In the density space such a contour will include almost the whole weight
of the field. Clearly there will be an infinite variety of systems from which
it is possible to choose a contour satisfying such a condition....
The second source of error is more difficult to control, but if wrong
judgments cannot be avoided, their seriousness will at any rate be diminished
if on the whole Hypothesis A is wrongly accepted only in cases where the true
sampled population, q, differs but slightly from p.
The 1928 paper goes on to say that, intuitively, it stands to reason that the
likelihood ratio is the right way to accomplish this. The point of the 1933
paper is to more rigorously justify the use of the likelihood ratio (hence the
famous "lemma", which is really not set off as a separate lemma...). Before
unleashing the calculus of variations, however, they warm up with some more
justification (pp. 295--296 of their 1933):
Let us now for a moment consider the form in which judgments are made in
practical experience. We may accept or we may reject a hypothesis with varying
degrees of confidence; or we may decide to remain in doubt. But whatever
conclusion is reached the following position must be recognized. If we reject
H0, we may reject it when it is true; if we accept H0, we
may be accepting it when it is false, that is to say, when really some
alternative Ht is true. These two sources of error can
rarely be eliminated completely; in some cases it will be more important to
avoid the first, in others the second. We are reminded of the old problem
considered by LAPLACE of the number of votes in a court of judges that should
be needed to convict a prisoner. Is it more serious to convict an innocent man
or to acquit a guilty? That will depend upon the consequences of the error; is
the punishment death or fine; what is the danger to the community of released
criminals; what are the current ethical views on punishment? From the point of
view of mathematical theory all that we can do is to show how the risk of the
errors may be controlled and minimised. The use of these statistical tools in
any given case, in determining just how the balance should be struck, must be
left to the investigator.
(Neither Laplace nor LAPLACE, are mentioned in their 1928 paper.)
Let's step back a little bit to consider the broader picture here. We have
a question about what the world is like --- which of several conceivable
hypotheses is true. Some hypotheses are ruled out on a priori
grounds, others because they are incompatible with evidence, but that still
leaves more than one admissible hypothesis, and the evidence we have does
not conclusively favor any of them. Nonetheless, we must chose one
hypothesis for purposes of action; at the very least we will act as
though one of them is true. But we may err just as much through rejecting
a truth as through accepting a falsehood. The two errors are symmetric, but
they are not the same error. In this situation, we are advised to pick a
hypothesis based, in part, on which error has graver consequences.
This is precisely the set-up of William James's "The Will to
Believe". (It's
easily accessible
online, as are summaries and interpretations; for instance,
an application
to current controversies by Jessa
Crispin.) In particular, James lays great stress on the fact that what
statisticians now call Type I and Type II errors are both errors:
There are two ways of looking at our duty in the matter of opinion,
— ways entirely different, and yet ways about whose difference the theory
of knowledge seems hitherto to have shown very little concern. We must know
the truth; and we must avoid error, — these are our first and great
commandments as would-be knowers; but they are not two ways of stating an
identical commandment, they are two separable laws. Although it may indeed
happen that when we believe the truth A, we escape as an incidental consequence
from believing the falsehood B, it hardly ever happens that by merely
disbelieving B we necessarily believe A. We may in escaping B fall into
believing other falsehoods, C or D, just as bad as B; or we may escape B by not
believing anything at all, not even A.
Believe truth! Shun error! — these, we see, are two materially
different laws; and by choosing between them we may end by coloring differently
our whole intellectual life. We may regard the chase for truth as paramount,
and the avoidance of error as secondary; or we may, on the other hand, treat
the avoidance of error as more imperative, and let truth take its chance.
Clifford
... exhorts us to the latter course. Believe nothing, he tells us, keep your
mind in suspense forever, rather than by closing it on insufficient evidence
incur the awful risk of believing lies. You, on the other hand, may think that
the risk of being in error is a very small matter when compared with the
blessings of real knowledge, and be ready to be duped many times in your
investigation rather than postpone indefinitely the chance of guessing true. I
myself find it impossible to go with Clifford. We must remember that these
feelings of our duty about either truth or error are in any case only
expressions of our passional life. Biologically considered, our minds are as
ready to grind out falsehood as veracity, and he who says, "Better go without
belief forever than believe a lie!" merely shows his own preponderant private
horror of becoming a dupe. He may be critical of many of his desires and
fears, but this fear he slavishly obeys. He cannot imagine any one questioning
its binding force. For my own part, I have also a horror of being duped; but I
can believe tbat worse things tban being doped may happen to a man in this
world: so Clifford's exhortation has to my ears a thoroughly fantastic sound.
It is like a general informing his soldiers that it is better to keep out of
battle forever than to risk a single wound. Not so are victories either over
enemies or over nature gained. Our errors are surely not such awfully solemn
things. In a world where we are so certain to incur them in spite of all our
caution, a certain lightness of heart seems healthier than this excessive
nervousness on their behalf. At any rate, it seems the fittest thing for the
empiricist philosopher.
From here the path to James's will to believe is pretty clear, at least in
the form he advocated it, which is that of picking among hypotheses which are
all "live"*, and where some choice must be made among them. What I am
interested in, however, is not the use James made of this distinction, but
simply the fact that he made it.
So far as I have been able to learn, no one drew this distinction between
seeking truth and avoiding error before James, or if they did, they didn't make
anything of it. (Even for Pascal in his wager, the idea that believing in
Catholicism if it is false might be bad doesn't register.) Yet this
is just what Neyman and Pearson were getting at, thirty-odd years later. There
is no mention of James in these papers, or indeed of any other source. They
present the distinction as though it were obvious, though eight decades of
subsequent teaching experience shows it is anything but. Neyman and Pearson
were very interested in the foundations of statistics, but seem to have paid no
attention to earlier philosophers, except for the arguable case of Pearson's
father Karl and
his Grammar of
Science (which does not seem to mention James). Yet there it is.
It really looks like two independent inventions of the whole scheme for judging
hypotheses.
My prejudices being what they are, I am much less inclined to think that
James illuminates Neyman and Pearson than the other way around. James was, so
to speak, arguing that we should trade significance — the risk of
mistaking noise for signal — for power, finding some meaningful signal in
what he elsewhere called the "blind molecular chaos" of the physical universe.
Granting that there is a trade-off here, however, one has to wonder
about how stark it really is (cf.), and whether his
will-to-believe is really the best way to handle it. Neyman and Pearson
suggest we should look for a procedure for resolving metaphysical questions
which maximizes the ability to detect larger meanings for a given risk of
seeing faces in clouds — and would let James and Clifford set their
tolerance for that risk to their own satisfaction. Of course, any such
procedure would have to squarely confront the fact that there may be no way of
maximizing power against multiple alternatives simultaneously...
The extension to confidence sets, consisting of all hypotheses not rejected
by suitably powerful tests
(per Neyman 1937) is left as an
exercise to the reader.
*: As an example of a "dead" hypothesis, James gives
believing in "the Mahdi",
presumably Muhammad Ahmad
ibn as-Sayyid Abd Allah. I'm not a Muslim, and those of my ancestors who
were certainly
weren't Mahdists, but this was
still a "What do you mean 'we', white man?" moment in my first reading of the
essay. To be fair, James gives me many fewer such moments than most of his
contemporaries.
Manual trackback: Brad DeLong; Robo;
paperpools (I am not worthy!)
Enigmas of Chance;
Philosophy;
Modest Proposals
Posted by crshalizi at December 28, 2009 00:08 | permanent link
December 08, 2009
Uniform Probability on Infinite Spaces Considered Harmful
Attention conservation notice: 1000 words on a short
probability problem. Several hobby-horses get to take turns around the
yard.
Wolfgang at Mostly
Harmless poses
a problem (I've lightly tweaked the notation):
Consider a random process X(t) which generates a series of 0s and 1s, but many more 0s because the probability for X(t) = 1 decreases with t as 2-t.
Now assume that we encounter this process not knowing 'how far we are
already', in other words we don't know the value of t. The question is:
"What is the probability to get a 1?"
Unfortunately there are two ways to answer this question. The first calculates the 'expectation value', as a physicist would call it, or 'the mean' as a statistician would put it, which is zero.
In other words, we sum over all possible t with equal weight and have to consider s = sum( 2-t ) with t = 1, 2, ... N; It is not difficult to see that s = 1/2 + 1/4 + ... equals 1.
The answer is therefore Pr(X=1) = s/N = 1/N and because N is infinite (the process never ends) we get Pr(X=1) = 0.
The second answer simply looks at the definition of the process and points out that Pr(X=1) = 2-T, where T is the current value of t. Although we don't know T it must have some finite value and it is obvious that Pr(X=1) > 0.
So which one is it, Pr(X=1) = 0 or Pr(X=1) > 0?
Fatwa: The second answer is correct, and the first is wrong.
Discussion: This is a cute example of the weirdness which results
when we attempt to put uniform distributions on infinite spaces, even in the
simplest possible case of the positive integers. The first way of proceeding
assumes that the notion of a uniform probability distribution on the natural
numbers makes sense, and that it obeys the same rules as an ordinary
probability distribution. Unfortunately, these two requirements are
incompatible. This is because ordinary probability distributions
are countably additive. We are all familiar with the fact
that probability adds across disjoint events: Pr(X= 0 or 1) =
Pr(X=0)+Pr(X=1). Moreover, we are all comfortable with the
idea that this holds for more than two events. The probability that X
first =1 by time 3 is the sum of the probability of the first 1 being at
time t=1, plus it being at t=2, plus it being at t=3.
Carrying this out to any finite collection of disjoint events is
called finite additivity. However, as I said, probability
measures are ordinarily required to be countably additive, meaning
that this holds even for a countable infinity of disjoint events.
And here we have trouble. The natural numbers are (by definition!)
countable, so the probability of all integers is the sum of the probability of
each integer,
Pr(T an integer) = sum(Pr(T=t))
The left-hand side must be 1. For a uniform distribution, we expect that
all the terms in the sum on the right-hand side must be equal, otherwise it's
not "uniform". But either all the terms are equal and positive, in which case
the right-hand side is infinite, or all the terms are equal and zero, in which
case the right-hand side is zero. Hence, there is no countably-additive
uniform probability measure on the integers, and the first approach, which
leads to the conclusion that Pr( X( T)=1)=0, is mathematically
incoherent.
Now, there are such things as finitely-additive probability measures, but
they are rather weird beasts. To specify one of them on the integers, for
example, it's not enough to give the probability of each integer (as it is for
a countably-additive measure); that only pins down the probability of finite
sets, and sets whose complements are finite. It does not, for example, specify
the probability of the even numbers. There turn out to be several different
ways of defining uniform distributions on the natural
numbers, which are not
equivalent. Under all of them, however, any finite set must have
probability zero, and so at a random time T, it is almost certain that
Pr(X(T)=1) is less than any real number you care to name.
Hence, the expectation value of this random probability is indeed zero.
(Notice, however, that if I try to calculate the expectation value
of any function f(t) by taking a probability-weighted
sum over values of t, as the first answer does, I will get the answer 0
when T follows a uniform finitely-additive measure, even
if f(t)=1 for all t. The weighted-sum-of-arguments
definition of expectation — the one reminiscent of Riemann integrals
— does not work for these measures. Instead one must use a
Lebesgue-style
definition, where one takes a weighted sum of the values of the
function, the weights being the measures of the sets giving those values.
[More exactly, one partitions the range of f and takes the limit as the
partition becomes finer and finer.] The equivalence of the summing over the
domain and summing over the range turns on, precisely, countable additivity.
The argument in the previous paragraph shows that here this expectation value
must be less than any positive number, yet not negative, hence zero.)
Finitely-additive probability measures
are profoundly
weird beasts, though some of my colleagues have what I can only consider
a perverse
affection for them. On the other hand, attempts to construct a natural
countably-additive analog of a uniform distribution on infinite sets have
been universally
unsuccessful;
this very much
includes the maximum entropy approach. The project of giving ignorance a
unique representation as a probability measure is, IMSAO, a failure. If one
picks some countably-additive prior distribution over the integers,
however, then at least one value of t must have strictly positive
probability, and the expectation value of Pr(X(T)=1) is positive,
though how big it is will depend on the prior distribution.
(As usual, the role of a Bayesian prior distribution is
to introduce bias so as to reduce variance.) Alternately, one simply follows
the second line of reasoning and concludes that, no matter what t might
be, the probability is positive.
Enigmas of Chance
Posted by crshalizi at December 08, 2009 10:55 | permanent link
December 05, 2009
36-350, Data Mining: Course Materials (Fall 2009)
My lesson-plan having survived first contact with
the enemy students, it's time to start posting the lecture
handouts & c. This page will be updated as the semester goes on; the RSS
feed for it should be here.
The class homepage has more
information.
- Introduction
to the course (24 August) What is data mining? how is it used? where did it
come from? Some themes.
- Information
retrieval and similarity searching I (26 August) Finding the data you are
looking for. Ideas we will avoid: meta-data and cataloging; meanings. Textual
features. The bag-of-words representation; its vector form. Measuring
similarity and distance for vectors. Example with the New York Times
Annotated Corpus.
- IR continued (28 August). The
trick to searching: queries are documents. Search evaluation: precision,
recall, precision-recall curves; error rates. Classification: nearest
neighbors and prototypes; classifier evaluation by mis-classification rate and
by confusion matrices. Inverse document frequency weighting. Visualizing
high-dimensional data by multi-dimensional scaling. Miscellaneous topics:
stemming, incorporating user feedback.
Homework 1, due 4 September: assignment,
R, data; solutions
- Page
Rank (31 August). Links as pre-existing feedback. How to exploit link
information? The random walk on the graph; using the ergodic theorem.
Eigenvector formulation of page-rank. Combining page-rank with textual
features. Other applications. Further reading on information retrieval.
- Image
Search, Abstraction and Invariance (2 September). Similarity search for
images. Back to representation design. The advantages of abstraction:
simplification, recycling. The bag-of-colors representation. Examples.
Invariants. Searching for images by searching text. An example in practice.
Slides for this lecture.
- Information
Theory I (4 September). Good features help us guess what we can't
represent. Good features discriminate between different values of unobserved
variables. Quantifying uncertainty with entropy. Quantifying reduction in
uncertainty/ discrimination with mutual information. Ranking features based on
mutual information. Examples, with code, of informative words for
the Times. Code.
Supplementary reading: David
P. Feldman, Brief Tutorial on
Information Theory, chapter 1
Homework 2, due 11 September: assignment; solutions
text
and R code
- Information Theory II (9
September). Dealing with multiple features. Joint entropy, the chain rule for
entropy. Information in multiple features. Conditional information, chain
rule for information, conditional independence. Interactions, positive and
negative, and redundancy. Greedy feature selection with low redundancy.
Example, with code, of selecting words for the Times. Sufficient
statistics and the information
bottleneck. Code.
Supplementary reading; Aleks Jakulin and Ivan Bratko, "Quantifying and
Visualizing Attribute
Interactions", arxiv:cs.AI/0308002
- Categorization;
Clustering I (11 September). Dividing the world up into categories.
Classification: known categories with labeled examples. Taxonomy of learning
problems (supervised, unsupervised, semi-supervised, feedback, ...).
Clustering: discovering unknown categories from unlabeled data. Benefits of
clustering, with an digression on where official classes come from. Basic
criterion for good clusters: lots of information about features from little
information about cluster. Practical considerations: compactness, separation,
parsimony, balance. Doubts about parsimony and balance. The k-means
clustering algorithm, or unlabeled prototype classification: analysis,
geometry, search. Appendix: geometric aspects of the prototype and
nearest-neighbor method.
Homework 3, due 18
September: assignment; solutions
- Clustering II (14 September).
Distances between partitions; variation-of-information distance.
Hierarchical clustering by agglomeration and its varieties. Picking the
number of clusters by merging costs. Performance of different clustering
methods on various doodles. Why we would like to pick the number of clusters
by predictive performance, and why it is hard to do at this stage. Reifying clusters.
- Transformations: Rescaling and
Low-Dimensional Summaries (16 September). Improving on our original
features. Re-scaling, standardization, taking logs, etc., of individual
features. Forcing things to be Gaussian considered harmful. Low-dimensional
summaries by combining features. Exploiting geometry to eliminate redundancy.
Projections on to linear subspaces. Searching for structure-preserving
projections.
- Principal Components I (18
September). Principal components are the directions of maximum variance.
Derivation of principal components as the best approximation to the data in a
linear subspace. Equivalence to variance maximization. Avoiding explicit
optimization by finding eigenvalues and eigenvectors of the covariance matrix.
Example of principal components with cars; how to tell a sports car from a
minivan. The standard recipe for doing PCA. Cautions in interpreting
PCA. Data-set used in the notes.
Homework 4, due 25
September: assignment; solutions
- Principal
Components II (21 September). PCA + information retrieval = latent
semantic indexing; why LSI is a Good Idea. PCA and multidimensional scaling.
- Factor
Analysis (23 and 25 September). From PCA to factor analysis by adding
noise. Roots of factor analysis in causal discovery: Spearman's general factor
model and the tetrad equations. Problems with estimating factor models: number
of equations does not equal number of unknowns. Solution 1, "principal
factors", a.k.a. estimation through heroic feats of linear algebra. Solution
2, maximum likelihood, a.k.a. estimation through imposing distributional
assumptions. The rotation problem: the factor model is
unidentifiable; the number of factors may be meaningful, but the individual
factors are not.
- The
Truth about PCA and Factor Analysis (28 September) PCA is data reduction
without any probabilistic assumptions about where the data came from. Picking
number of components. Faking predictions from PCA. Factor analysis makes
stronger, probabilistic assumptions, and delivers stronger, predictive
conclusions --- which could be wrong. Using probabilistic assumptions and/or
predictions to pick how many factors. Factor analysis as a first, toy
instances of a graphical causal model. The rotation problem once more with
feeling. Factor models and mixture models. Factor models and Thomson's
sampling model: an outstanding fit to a model with a few factors is actually
evidence of a huge number of badly measured latent variables.
Final advice: it all depends, but if you can only do one, try PCA.
R
code for the Thomson sampling model.
- Nonlinear
Dimensionality Reduction I: Locally Linear Embedding (5 October). Failure
of PCA and all other linear methods for nonlinear structures in data; spirals,
for example. Approximate success of linear methods on small parts of nonlinear
structures. Manifolds: smoothly curved surfaces embedded in higher-dimensional
Euclidean spaces. Every manifold looks like a linear subspace on a
sufficiently small scale, so we should be able to patch together many small
local linear approximations into a global manifold. Local linear embedding:
approximate each vector in the data as a weighted linear combination of
its k nearest neighbors, then find the low-dimensional vectors best
reconstructed by these weights. Solving the optimization problems by linear
algebra. Coding up LLE. A spiral
rainbow. R.
- Nonlinear
Dimensionality Reduction II: Diffusion Maps (9 October). Making a graph
from the data; random walks on this graph. The diffusion operator,
a.k.a. Laplacian. How the Laplacian encodes the shape of the data.
Eigenvectors of the Laplacian as coordinates. Connection to page-rank.
Advantages when data are not actually on a manifold. Example.
Pre-midterm review (12 October): highlights of the course to date; no
handout.
MIDTERM (14
October): exam, solutions
Homework 5, due 23 October:
assignment;
solutions
- Regression
I: Basics. Guessing a real-valued random variable; why expectation values
are mean-square optimal point forecasts. The regression function; why its
estimation must involve assumptions beyond the data. The bias-variance
decomposition and the bias-variance trade-off. First example of improving
prediction by introducing variance. Ordinary least squares linear regression
as smoothing. Other linear smoothers: k-nearest-neighbors and kernel
regression. How much should we
smooth? R, data
for running example
- Regression
II: The Truth About Linear Regression (21 October). Linear regression is
optimal linear (mean-square) prediction; we do this because we hope a linear
approximation will work well enough over a small range. What linear regression
does: decorrelate the input features, then correlate them separately with the
response and add up. The extreme weakness of the probabilistic assumptions
needed for this to make sense. Difficulties of linear regression;
collinearity, errors in variables, shifting distributions of inputs, omitted
variables. The usual extra probabilistic assumptions and their implications.
Why you should always looking at residuals. Why you generally shouldn't use
regression for causal inference. How to torment angels. Likelihood-ratio
tests for restrictions of nice models.
- Regression III: Extending Linear
Regression (23 October). Weighted least squares. Heteroskedasticity:
variance is not the same everywhere. Going to consult the oracle. Weighted
least squares as a solution to heteroskedasticity. Nonparametric estimation of
the variance function. Local polynomial regression: local constants (= kernel
regression), local linear regression, higher-order local polynomials. Lowess =
locally-linear smoothing for scatter plots. The oracles fall silent.
Homework 6, due Friday, 30 October: assignment, data set; solutions
- Evaluating Predictive Models (26
and 28 October). In-sample, out-of-sample and generalization loss or error;
risk as expected loss on new data. Under-fitting, over-fitting, and examples
with polynomials. Methods of model selection and controlling over-fitting:
empirical risk minimization, penalization, constraints/sieves, formal learning
theory, cross-validation. Limits of
generalization. R for creating figures.
- Smoothing
Methods in Regression (30 October). How much smoothing should we do?
Approximation by local averaging. How much smoothing we should do to
find the unknown curve depends on how smooth the curve really is,
which is unknown. Adaptation as a partial substitute for actual knowledge.
Cross-validation for adapting to unknown smoothness. Application: testing
parametric regression models by comparing them to nonparametric fits. The
bootstrap principle. Why ever bother with parametric
regressions? R
code for some of the examples.
Homework 7, due Friday, 6
November: assignment;
solutions: text
and code
- Additive
Models (2 November). A nice feature of linear models: partial responses,
partial residuals, and backfitting estimations. Additive models: regression
curve is a sum of partial response functions; partial residuals and the
backfitting trick generalize. Parametric and non-parametric rates of
convergence. The curse of dimensionality for unstructured nonparametric
models. Additive models as a compromise, introducing bias to reduce variance.
Example with the data from homework 6.
- Classification
and Regression Trees (4 and 6 November). Prediction trees. A
classification tree we can believe in. Prediction trees combine simple local
models with recursive partitioning; adaptive nearest neighbors. Regression
trees: example; a little math; pruning by cross-validation; more R mechanics.
Classification trees: basics; measuring error by mis-classification; weighted
errors; likelihood; Neyman-Pearson classifiers. Uncertainty for trees.
Homework 8, due 5 pm on Monday, 16 November: assignment; solutions; R for solutions
- Combining Models 1: Bagging and Model Averaging (9 November)
- Combining Models 2: Diversity and Boosting (11 November)
- Linear Classifiers (16 November).
Geometry of linear classifiers. The perceptron algorithm for learning linear
classifiers. The idea of "margin".
- Logistic Regression (18
November). Attaching probabilities to linear classifiers: why would we want
to? why would we use the logistic transform to do so? More-than-binary
logistic regression. Maximizing the likelihood; Newton's method for
optimization. Generalized linear models and generalized additive models;
testing GLM specifications with GAMs.
- Support Vector Machines (20
November). Turning nonlinear problems into linear ones by expanding into
high-dimensional feature spaces. The dual representation of linear
classifiers: weight training points, not features. Observation: in the dual
representation, only inner products of vectors matter. The kernel trick:
kernel functions let us compute inner products in feature spaces without
computing the features. Some bounds on the generalization error of linear
classifiers based on "margin" and the number of training points with non-zero
weight ("support vectors"). Learning support vector machines by trading
in-sample performance against bounds on over-fitting.
Homework 9, due at 5 pm on Monday, 30 November: assignment
- Density Estimation (23 November).
Histograms as distribution estimates. Glivenko-Cantelli, "the fundamental
theorem of statistics". Histograms as density estimates; selecting density
estimates by cross-validation. Kernel density estimates. Why kernels are
better than histograms. Curse of dimensionality again. Hint at alternatives
to kernel density estimates.
- Mixture Models, Latent Variables and
the EM Algorithm (30 November). Compressing and restricting density
estimates. Mixtures of limited numbers of distributions. Mixture models as
probabilistic clustering; finally an answer to "how many clusters?" The EM
algorithm as an iterative way of maximizing likelihood with latent variables.
Analogy to k-means. More theory of the EM algorithm. Applications: density
mixtures, signal processing/state estimation, mixtures of regressions, mixtures
of experts; topic models and probabilistic latent semantic analysis. A glance
at non-parametric mixture models.
- Graphical Causal Models (2 December). Distinction between causation and
association, and between causal and probabilistic prediction. Some examples.
Directed acyclic graphs and causal models. The Markov property. Conditional
independence via separation. Faithfulness.
- Causal Inference (4 December).
Estimating causal effects; control for confounding. Discovering causal
structure: the SGS algorithm and its variants. Limitations.
Take-home final exam, due 15
December: assignment; data
sets: expressdb_cleaned
(20
Mb), HuIyer_TFKO_expression
(20 Mb). With great thanks to Dr. Timothy Danford.
Corrupting the Young;
Enigmas of Chance
Posted by crshalizi at December 05, 2009 14:39 | permanent link
November 30, 2009
Books to Read While the Algae Grow in Your Fur, November 2009
- Jen Van Meter, Christine Norrie and Chynna Clugston-Major, Hopeless Savages
- Incredibly sweet and charming; whether it's really punk rock I couldn't
say. (I completely forget where I saw this recommended, but thanks to whoever
it was.)
- Mike Mignola and Christopher Golden, Baltimore, or, the Steadfast
Tin Soldier and the Vampire
- Stories within stories, framed by the Great War unleashing not the
influenza pandemic of 1918, but a vampire-zombie apocalypse. Many, many nods
to prior horror fiction (most obviously Dracula, but also "The
Masque of the Red Death", etc.), and a lot of folkloric elements used to nicely
creepy effect. (But isn't "Mircea" a masculine name?) Mignola's drawings are
decorative and atmospheric, but not integral.
- Cat Rambo and Jeff VanderMeer, The Surgeon's Tale, and Other Stories
- The highlight is the title story, which occupies about half this little
book, and breathes new life — you should forgive the expression —
into the ancient trope of the Resurrection Gone Awry. Of the rest, Rambo's
"The Dead Girl's Wedding March" and "A Key Decides Its Destiny" are the best,
followed by VanderMeer's "The Farmer's Cat". About the last story, an extended
joke about a Lovecraftian menu, the kindest thing I can say is that the authors
must've had fun writing it.
- F. T. Marinetti, The Untameables
- Not actually recommended, unless you want a
violent Futurist words-in-liberty
fantasy full of orientalism, racism, and (most poisonously) formulaic
decadence. On this evidence, Marinetti was much better at writing manifestoes
(and cookbooks)
than fiction. — I have had this on my shelf since, so help me, 1994,
when I first started reading about Futurism; I should've gotten rid of it long
ago.
- Jason Aaron, R. M. Guéra, Davide Furnò and Francesco Francavilla, Scalped, vol. 5: High Lonesome
- Noir blacker than coal-dust. Earlier installments: 1, 2--4.
- Phil and Kaja Foglio, Agatha Heterodyne
and the: Golden Trilobite; Voice of the Castle;
Chapel of Bones
- Vols. 6--8 of Girl Genius; in which the lost heir reclaims the
ancestral castle, through the power of Science! (As well as perfecting
the coffee-maker.)
- Lev Vygotsky, Mind in Society: Development of Higher
Psychological Processes
- A fairly clear and cohesive statement of Vygotsky's key ideas, which were a
species of pre-cognitive
Marxist psychology. Here is his concerned with looking at what happens to
children's cognitive development when they bring together their practical
abilities to manipulate their bodies and tools, with their communicative
abilities to use words (and other signs) — specifically their learning to
use speech to guide behavior, especially their own behavior. (This
is very much about the unity of theory and praxis; but Dewey said
similar things, from a background of American pragmatism. [Then again,
Marx himself was pretty pragmatist already in 1845.]
Vygotsky mentions Dewey once here, without much understanding.) Specifically,
Vygotsky claims that speech and discursive thought come to guide behavior
through children learning to
talk to themselves about what they need to do to solve practical
tasks, which they come to from previously learning to talk to others
about what to do, or trying to get others to do it for them.
- This sets up three big themes of Vygotsky's. First, he thinks that all of
the characteristically human ("higher") mental processes originate as social
interactions, which we then learn to internalize and carry out independently.
The Marxist themes (especially out of Engels) here are obvious; he does not,
needless to say, demonstrate his contention, and in any case seems to overlook
the point that an organism needs a lot of specialized structure and capacity to
engage in those social interactions in the first place, let alone internalize
them! But he deserves, I think, considerable credit for raising the problem,
and the related one of how we use tools and signs in our environment to extend
our own cognition. (Here some of the experiments he reports on what's needed
for children to make effective use of memory aids are fascinating; but this
approaches stigmergy
rather than social labor.) Secondly, he emphasizes, when assessing children's
development, that looking at what they can do on their own is just picking out
(at best) what they have finished learning. If instead one assesses
what they can do "with assistance, collaboratively or in groups", what he calls
the "zone of proximal development", one gets a sense of what they are
learning and could learn. One might argue, though he doesn't
pursue this, that even in adults, activities like scientific investigation
never really leave the zone of proximal
development, especially not at the highest levels of accomplishment.
(Skilled scientists can do their students' homework problems on their own, but
not their research.) Thirdly, he is emphatic that if you want to investigate
cognitive development, you need to probe the
developmental process, and not just its end-products in over-learned
habits or polished skills. He suggests that the ideal experiments would
actually evoke cognitive development under laboratory conditions, and
that his students carried out such experiments; he does not provide enough
details to assess such a claim.
- The book concludes with some chapters on play, imagination and
make-believe, and on the "pre-history" of writing (i.e., things which come
before writing in individual development but have some kinship to it).
- Despite the way I've written, this isn't actually a book Vygotsky planned
and wrote out. It was assembled by its translators out of distinct Russian
manuscripts and preliminary translations provided
by A. R. Luria, and then considerably revised by the
translators. (I presume this is where anarchonisms like "World War I" came
from.) How much of the result is really due Vygotsky, or to
Vygotsky-and-Luria, and how much to the Americans, I can't say. The latter do
however provide an introduction, a brief biographical note and an extensive
afterword; all of these are probably most useful to readers previously
unacquainted with Vygotsky or his school.
- Michael
Bérubé, The Left at War
- The first half of this is Bérubé arguing with (as he says)
the "Manichean Left" over Afghanistan, Iraq, the Balkan wars of the 1990s, and
their general orientation and understanding of how the capitalist democracies
work, or don't. I find myself in complete agreement with this, including
Bérubé's positive vision, and unable to add anything of value.
(Read it.)
- The second half is an argument about the theory of ideology, the notion of
hegemony, and the not-exactly-a-discipline of cultural studies. More
specifically, it's a plea to get beyond the dualism of "everything with any
shred of mass appeal is a tool of the System" on the one hand, and pretending
that fans reading their own meaning into music videos (or whatever) has
anything to do with smashing said system, on the other. His plea is for a more
nuanced approach to ideology, which recognizes that the leading ideas of any
society are never all of a block, that political power always comes from
coalitions bringing together many divergent interests and ideas, etc., etc. He
is particularly fond of a version of these ideas articulated by Stuart Hall,
which seem, to judge by his quotations, quite reasonable but not at all
special, unless one is starting out from the most benighted precincts
of Marxism (e.g.,
that old
fraud Althusser). Linked to this, Bérubé is quite strenuous
about the importance of issues other than economic justice for any left that
wants to be serious about making sure that everyone isn't just formally free,
but can actually use and enjoy their freedom.
- Again, I find it hard to disagree with most of this; I just fail to see
what the two halves of the book have to do with each other. The best argument
I can reconstruct on his behalf would be something like this: lots of people on
the left have drifted, or been pushed, into Manichean positions because it
seems to follow from the way they understand ideology. If a better account had
been widely disseminated, fewer of them would have pursued that dead end. Some
parts of cultural studies had articulated that better account, but they failed
to make themselves heard; thus, if only more attention had been paid to debates
about how best to make use of Gramsci to understand British politics in the
early 1980s, the left would not have backed itself into such a corner in
2001--2003.
- I find myself drifting into snark in that last sentence, which is unfair.
I agree that the very crude counter-cultural thinking and functionalism which
seem very common on the left are Not Helpful. I'm just skeptical that (1)
giving more weight to cultural studies would have made this better, and that
(2) the particular cultural studies sub-tradition Bérubé points
to is really the best available way of thinking about these matters. I
have my own favorite candidate,
but mostly it's that what he takes from this sub-tradition just don't seem that
distinctive, except for its Marxist background. (For instance,
compare what Bérubé, following Hall, says about first asking
what's right or true about an ideology with
what Boudon
said in his excellent book
on The
Analysis of Ideology [= L'origine des idées
reçues]. And no, I'm not trying to play "My esoteric French
theorist trumps your merely-obscure British theorist"; Boudon is a distinguished
sociologist who just happens to have made this the center of his theory of
ideology.) Which is not to say that we shouldn't work to develop and
disseminate better ideas about these matters!
- Bérubé tends to avoid advancing his case directly, but rather
to get his point across by discussing some other writer's work, or in some
cases some other writer's discussion of a third author, etc. (I suspect this
is a professional deformation of literary critics.) This is a manner of
writing which drives some people crazy, and one which I find very tiresome in
other hands, but he pulls it off. Assuming this won't put you off, I recommend
this very strongly if you have any interest at all in progressive
politics.
- Disclaimer: Bérubé blogs at Crooked Timber, where
I've guest-blogged, etc., but we've never met or corresponded, and I have no
stake in the success of his book.
- Manual trackback: Michael Bérubé.
- Rick Geary, Trotsky: A Graphic Biography
- Well-told and well-drawn — though nothing in the rest of the art
matches to the level of the hero/monster panels of the opening pages (and
cover!).
- Sarah
Graves,
Unhinged;
Mallets
Aforethought; Tool and Die; Nail Biter;
Trap
Door
- Honestly, I'm a bit surprised series fatigue hasn't set in yet; but I
continue to enjoy these. (And Eastport
continues to suffer levels of violent
death comparable to post-invasion Iraq. The fact that this homicide rate has
not yet attracted official attention suggests that the serial killer is Bob
Arnold, the police chief, rather than Jake Tiptree or Ellie White.)
Previous installments:
1--4,
5; sequel: 11.
- Jonathan
Israel, A
Revolution of the Mind: Radical Enlightenment and the Intellectual Origins of
Modern Democracy
- The story, or part of the story, of how the outlandish and unprecedented
ideology of a network of radical, subversive scribblers became what we all at
least pay lip-service to. Really deserves a detailed discussion; I'll just say
that there's a lot of fascinating material in here, but also many places where
I felt he didn't really prove his point, even or especially when I was very
sympathetic to what he was saying.
- Stephen Budiansky, The Bloody Shirt: Terror After the Civil War
- Once upon a time, the US Army attempted to bring democracy to a backward
part of the world which had long been wracked by ethnic conflict. There were
some promising beginnings, but the defeated, formerly dominant faction refused
to accept that their relative demotion, and engaged in a vicious,
well-organized campaign of terrorism, which ultimately proved to
be entirely successful. Those who had trusted enough in the power and
benevolence of the United States enough to participate in the governments
ultimately overthrown by "violence and fraud" (in the words of one of the
over-throwers) were lucky to escape with their lives (as many did not).
Minimal democratic norms were not re-established for ninety years or more.
- This is of course the story of the failed Reconstruction of the South after
the civil war, which Budiansky tells by recounting the inter-cut, and
occasionally overlapping, lives of a number of individuals on the
Reconstruction side of the conflict. One of his more effective tactics is to
quote extensively from their letters and journals, as well as from contemporary
books and newspapers. Caveat lector: many of these —
especially the newspapers — are full of vicious racist bile, as well as
the astonishing lies elite white Southerners told to portray themselves as
oppressed victims. (This begins with the story of the "bloody shirt" that
opens the book.) This stuff was hard for me to stomach, and might be too much
for some.
- My biggest complaint with the book is that I wish Budiansky had done more
to tell the stories of black Americans, the way he did with his white subjects
— not that there are none, I hasten to add. I can guess at reasons why
it would be harder to find materials (all of them ultimately having to do with
the fact that Southern blacks were an oppressed people who emerged from slavery
for a few years before being crushed back down to serfdom), but still... That
said, Budiansky's story of crushed hopes, futile bravery and murderous hatred
is wonderfully written and incredibly depressing. I hope that it fills many
American with the sort of patriotic shame which helps us be better.
- Luc Devroye and Gabor Lugosi, Combinatorial Methods in Density
Estimation
- The fundamental theorem of statistics,
says Pitman, is
the Glivenko-Cantelli
theorem: the empirical distribution function Fn of a
large sample of independent, identically-distributed random variables comes
arbitrarily close to their true distribution function F: as n
goes to infinity, maxx |Fn(x)
- F(x)| goes to 0 almost surely. This means that we can
learn any underlying probability distribution to arbitrary accuracy just by
collecting enough data.* Unfortunately the empirical distribution function is
always discrete, so it doesn't have a density, even if the underlying
distribution does. Or, if you like, it has a density, but it's a mixture of
Dirac delta functions. (The convergence is in the sense of "weak convergence"
or "convergence in distribution".) Density estimation is basically about
taking the empirical distribution function and smoothing it so that it has a
well-behaved density. The oldest way of doing this is to build a histogram,
which gives constant densities to intervals; other methods include fitting
function series (Fourier or wavelet expansions) to the data, or using kernels
(replacing each of the delta function spikes with a smooth density, say a
Gaussian bell-shaped curve). The art here is to pick the manner of smoothing,
and the amount of smoothing, so that (1) the convergence promised by
Glivenko-Cantelli for the unsmoothed distribution is not just maintained but is
(2) strengthened to convergence of the estimated density on the true density,
and ideally (3) the latter convergence happens rapidly.
- Devroye and Lugosi's book is devoted to establishing conditions under which
common density estimators have these three desirable properties (or, more
rarely, when they do not). Throughout, they focus on the "total variation"
or L1 distance between
densities: dTV(f,g) is the integral of
|f(x) - g(x)| over all x. They
mention, but generally avoid, other common distances or pseudo-distances such
as L2 (integral of |f(x)
- g(x)|2), Hellinger distance (too ugly to write
in HTML), or relative entropy (Kullback-Leibler divergence, expected
log-likelihood ratio). The total variation distance has a very natural
probabilistic interpretation (the maximum amount by which the estimated
probability of any event differs from its true probability), and they can get
very nice finite-sample bounds by minimizing it over various classes of
possible estimates, so this choice is eminently defensible; it does however cut
them off from using a lot of existing theory. (For instance, the optimal
coefficients in a Fourier series, from an L1 point of view,
are not just the empirical Fourier coefficients, since the latter
are L2 optimal.)
- Their general goal is to prove finite-sample upper bounds on
the L1 error of their density estimates; if these go to
zero as n grows, we get (1) and (2) above, and the rate of convergence
tells us how close we are to obtaining (3). Their route to this goal is almost
always through VC theory, and
empirical process
theory more generally. As always, this has two parts: one is deviation
inequalities
(e.g., Hoeffding's)
which bound the probability that any one candidate density will look much
better in sample than it will look out of sample. The other part is
combinatorial arguments that the behavior of an entire space of functions can
be approximated by that of a finite number of key functions. Meshed together
by a union bound, these
give uniform concentration bounds, with rates of convergence depending on the
complexity of the combinatorial construction needed to achieve a given degree
of approximation (i.e., the VC dimension). Devroye and Lugosi's key theorems
bound the error of their density estimates in terms of the VC dimensions of the
sets formed by comparing two densities in the class. (Specifically, they are
interested in the sets where one estimate is higher than another by a given
amount; this is, as they note, extremely similar to the threshold procedure
used to apply VC theory to regression problems.) Finite VC dimension for such
sets implies convergence to within a constant factor of the best available
approximation to the true density. They extend such results to ones where the
amount of smoothing is determined by data-set splitting, i.e., dividing the
data into a training and a testing set, and picking the degree of smoothing
which best generalizes from the training set to the testing set. (They do not
consider any other form of cross-validation, which is a shame because they're a
lot more common than simple data-splitting, but understandable because they're
very ugly to analyze.) They give a lot of attention to kernel density
estimates, including bounds for continuous kernels in terms of how hard it is
to approximate them by simple step-functions, for which the combinatorics are
easy.
- Strictly speaking, the book presupposes measure-theoretic probability, but
readers uncomfortable with sigma-fields and Radon-Nikodym derivatives
could mostly get away with ignoring the former and reading
"probability density functions" for the latter. Similarly, the actual
combinatorics are either elementary, or can be taken on trust. This book is
probably not the best way to first encounter density estimation — I
suspect a less theoretical introduction would not only make the ideas clearer,
but also make readers want theoretical guidance — but no
experience on that score is, strictly, necessary. Neither, really, is prior
knowledge of learning theory or VC theory, though again it would probably help.
The ideal situation for the book is, I'd guess, a second-year graduate-level
course on density estimation (there are many excellent problems), or
self-study.
- *: Well, we have to pretend the data are IID, but
let that slide. Or: assume sufficiently rapid strong mixing and argue, as
in Vidyasagar, that VC results then
hold with tolerable corrections. Kernel density estimates for stochastic
processes are treated at length in
Bosq's Nonparametric
Statistics for Stochastic Processes: Estimation and Prediction, but
the starting point there is ergodic theory, not learning theory.
- George Clark, Science and Social Welfare in the Age of Newton
- Connections between the scientific revolution, economic development and
economic policy (such as it was) in late-17th and early-18th century England,
and to a lesser extent France and the Netherlands. Interesting stuff on the
connections between the activities of scientists and technological development,
including the shrewd observation, contra Marxists claiming that
scientific progress was basically directed to solving the capitalists'
problems, that there were plenty of lucrative problems where scientists got
nowhere, or didn't even try to get anywhere, because it was just
not scientifically feasible. Also some interesting material on the
early history of statistics. The first edition was published in 1937, and
shows both that it was written during the Depression, and that respectable
economists had no idea what was going on. (This does not much harm the
book.)
Books to Read While the
Algae Grow in Your Fur;
Scientifiction and Fantastica;
Pleasures of Detection, Portraits of Crime;
The Progressive Forces;
The Great Transformation;
The Beloved Republic;
Minds, Brains, and Neurons;
Enigmas of Chance;
The Continuing Crises;
Cthulhiana
Posted by crshalizi at November 30, 2009 23:59 | permanent link
November 24, 2009
(I Don't Give a Damn About My) Bad Reputation
Attention conservation notice: Unskillful nattering about
pop-culture ephemera.
For the sake of my own sanity, I prefer to remain ignorant of the occult
processes by which the direct mail gods decide to which catalogues to send to
which people.
(There's too
much dynamic programming involved.) Today, for instance, they decided to
inflict upon me the official Barbie doll spring 2010 collection preview, and
like a fool I couldn't resist looking through it. Thus my life is made that
much worse by learning that there is
a Joan
Jett Barbie doll. (I thought about embedding an image, but in this case
pain shared is not pain eased.) I think I finally grasp what
people mean when they talk about later cultural products
assaulting parts of their
childhood, in this case one I didn't even realize I valued.
Manual trackback: Mostly Harmless
Linkage
Posted by crshalizi at November 24, 2009 19:27 | permanent link
November 21, 2009
"Homophily, Contagion, Confounding: Pick Any Three"
A number of people have asked for my slides from
the MERSIH conference
the other week. So,
here they are.
(Anyone who was at my talk at SFI about a year ago will recognize the title,
and much of the content.) I'm presently turning this into a proper manuscript,
so comments are welcome. Please don't rip it off; I'll become very cross and
may even hold my breath until I turn blue and pass out, and won't you be sorry
then?
Manual
trackback: Cognition
and Culture
Networks;
Enigmas of Chance;
Complexity;
Self-Centered
Posted by crshalizi at November 21, 2009 18:32 | permanent link
November 19, 2009
"Statistical Analysis of Stellar Evolution" (Next Week at the Statistics Seminar)
In which the starry heavens above submit to statistical analysis:
- David van Dyk, "Statistical
Analysis of Stellar Evolution"
- Abstract: Color-Magnitude Diagrams (CMDs) are plots that compare
the magnitudes (luminosities) of stars in different wavelengths of light
(colors). High non-linear correlations among the mass, color and surface
temperature of newly formed stars induce a long narrow curved point cloud in a
CMD known as the main sequence. Aging stars form new CMD groups of red giants
and white dwarfs. The physical processes that govern this evolution can be
described with mathematical models and explored using complex computer
models. These calculations are designed to predict the plotted magnitudes as a
function of parameters of scientific interest such as stellar age, mass, and
metallicity. Here, we describe how we use the computer models as a component of
a complex likelihood function in a Bayesian analysis that requires
sophisticated computing, corrects for contamination of the data by field stars,
accounts for complications caused by unresolved binary-star systems, and aims
to compare competing physics-based computer models of stellar evolution.
- This is joint work with Steven DeGennaro, Nathan Stein, William
H. Jefferys, Ted von Hippel, and Elizabeth Jeffery.
- Place and time: Doherty Hall A310, Monday, 23 November, 4--5 pm.
Enigmas of Chance;
The Eternal Silence of These Infinite Spaces;
Physics
Posted by crshalizi at November 19, 2009 12:02 | permanent link
November 13, 2009
"Some Things Statisticians Do at Google" (Next Week at the Statistics Seminar)
Attention conservation notice: Of no use to you unless
(1) you want to know what statisticians do at search-engine companies
and (2) you are in Pittsburgh.
- Mike Meyer, "Some Things Statisticians Do at Google"
- Abstract: I'll talk about a number of projects at Google where statisticians
have made a large contribution. There will not be a lot of technical
details. In some cases I will just describe the problem.
- The major example will be a description of the statistical and
engineering infrastructure to support live traffic experiments
at Google.
- A common theme of the problems is the importance of understanding
basic statistical principles that can be applied and modified to
handle new data and new circumstances.
- Place and time: Monday, 16 November at 4 pm, in Doherty Hall
A310
As always, the talk is free and open to the public.
Enigmas of Chance
Posted by crshalizi at November 13, 2009 15:09 | permanent link
November 08, 2009
The Shadow Price of Power
Attention conservation notice: Quasi-teaching note giving
an economic interpretation of the Neyman-Pearson lemma on statistical
hypothesis testing.
Suppose we want to pick out some sort of signal from a background of noise.
As every schoolchild knows, any procedure for doing this,
or test, divides the data space into two parts, the one where
it says "noise" and the one where it says "signal".* Tests will make two kinds
of mistakes: they can can take noise to be signal, a false
alarm, or can ignore a genuine signal as noise,
a miss. Both the signal and the noise are stochastic, or we
can treat them as such anyway. (Any determinism distinguishable from chance is
just insufficiently complicated.) We want tests where
the probabilities of both types of errors are small. The probability
of a false alarm is called the size of the test; it is the
measure of the "say 'signal'" region under the noise distribution. The
probability of a miss, as opposed to a false alarm, has no short name in the
jargon, but one minus the probability of a miss — the probability of
detecting a signal when it's present — is called power.
Suppose we know the probability density of the noise p and that of
the signal is q. The Neyman-Pearson lemma, as many though not all
schoolchildren know, says that then, among all tests off a given size s,
the one with the smallest miss probability, or highest power, has the form "say
'signal' if q(x)/p(x) > t(s),
otherwise say 'noise'," and that the threshold t varies inversely
with s. The quantity q(x)/p(x) is
the likelihood ratio; the Neyman-Pearson lemma says that to
maximize power, we should say "signal" if its sufficiently more likely
than noise.
The likelihood ratio indicates how different the two distributions —
the two hypotheses — are at x, the data-point we
observed. It makes sense that the outcome of the hypothesis test should depend
on this sort of discrepancy between the hypotheses. But why
the ratio, rather than, say, the difference q(x)
- p(x), or a signed squared difference, etc.? Can we make this
intuitive?
Start with the fact that we have an optimization problem under a constraint.
Call the region where we proclaim "signal" R. We want to maximize its
probability when we are seeing a signal, Q(R), while constraining
the false-alarm probability, P(R)
= s. Lagrange
tells us that the way to do this is to minimize Q(R)
- t[P(R) - s] over R and t jointly.
So far the usual story; the next turn is usually "as you remember from the
calculus of variations..."
Rather than actually doing math, let's think like economists. Picking the
set R gives us a certain benefit, in the form of the
power Q(R), and a cost, tP(R).
(The ts term is the same for all R.) Economists, of course, tell
us to equate marginal costs and benefits. What is the marginal
benefit of expanding R to include a small neighborhood around the point
x? Just, by the definition of "probability
density", q(x). The marginal cost is
likewise tp(x). We should include x in R
if q(x) > tp(x),
or q(x)/p(x) > t. The boundary of R
is where marginal benefit equals marginal cost, and that is why we need the
likelihood ratio and not the likelihood difference, or
anything else. (Except for a monotone transformation of the ratio, e.g. the
log ratio.) The likelihood ratio threshold t is, in fact, the
shadow price of
statistical power.
I am pretty sure I have not seen or heard the Neyman-Pearson lemma explained
marginally before, but in retrospect it seems too simple to be new, so pointers
would be appreciated.
Manual trackback: John Barrdear
Updates: Thanks to David Kane for spotting a typo.
*: Yes, you could have a randomized test procedure,
but the situations where those actually help pretty much define "boring,
merely-technical complications."
Enigmas of Chance
Posted by crshalizi at November 08, 2009 03:06 | permanent link
November 04, 2009
Blosxom Fading in November
My old Blosxom installation (v. 2.0.2),
after several years of working nicely, is growing increasingly cranky, and
mulishly refusing to generate or update posts as the whim takes it. (I am not
sure how much kicking and shoving it will need to produce this.) I'd
appreciate a pointer to something which works similarly, but does
work: I write posts in plain HTML in Emacs and drop them in a directory; it
makes them look nice. If it handles tags and/or LaTeX nicely, so much the
better.
Self-Centered
Posted by crshalizi at November 04, 2009 19:34 | permanent link
|