April 30, 2009

Books to Read While the Algae Grow in Your Fur, April 2009

Jérôme Dedecker, Paul Doukhan, Gabriel Lang, José Rafael León R., Sana Louhichi and Clémentine Prieur, Weak Dependence: With Examples and Applications
As every school-child knows, a sequence of probability measures Dn converges "weakly" or "in distribution" on a limit D if Dnf converges on Df for every bounded and continuous test function f. In particular, if Dn is the joint distribution of two random variables from a time series separated by a time-lag n, with marginal distributions P and Q, then the series is asymptotically independent if Dn converges in distribution on the product measure P*Q.
(This is "weak" convergence because the Dn can look very different from the limiting D. A classic example: Dn puts probability 1/n on the points 1/n, 2/n, ... n/n = 1. This converges weakly to the uniform distribution on the unit interval [0,1], despite apparent obstacles like the latter giving probability 1 to the irrational numbers, while all the Dn give them probability 0.)
A large literature has grown up since the 1950s which concerns itself with properties of asymptotically-independent stochastic processes, usually measuring the approach to independence by means of various "mixing coefficients", along the lines introduced by Rosenblatt. As the name implies, if these coefficients go to zero asymptotically, then the process is strongly mixing. Having these mixing coefficients go to zero is very nice, since it implies that many of the nice properties of IID sequences (central limit theorems, PAC-learning results, etc.), carry over to the dependent-but-mixing sequences. Unfortunately, strong mixing is a very strong property, which is hard to prove and in many cases is known to fail — even when there is asymptotic independence!
This book, which summarizes and extends a series of recent papers by the authors (in various combinations) is about proving limit theorems — laws of large numbers, central limit theorems, empirical-process convergence, etc. — for asymptotically-independent but non-mixing processes, which the authors call "weakly dependent". Their theorems in chs. 6--10 can, in large part, be seen as instances of the trick that I learned as "blocking", and which they attribute to S. N. Bernstein. To calculate the time average of some function of a (stationary) process, for instance, divide the series up into a series of long blocks, plus padding or filler in between them. The global average is, trivially, equal to the average of the within-block averages, plus a remainder term for contributions from the filler. By stationarity, the blocks all have the same distribution, and by weak dependence the blocks are nearly independent — more nearly with longer fillers. So one can show that the global average will have the same behavior as it would in the IID case, if one can control (1) the corrections due to the remaining dependence among the blocks, and (2) the remainder due to the fillers between the blocks. Controlling (2) is fairly straightforward; the new stuff here comes from controlling (1), the residual dependence.
Following lines laid down in mixing theory, the authors tackle (1) by constructing deviation inequalities, bounds on the probability that sample values differ from expectations by more than a certain amount. Typically, these inequalities (chs. 4 and 5) have a more-or-less combinatorial flavor, though they also use coupling, and a lot of manipulation of cumulants (without making the connection to large deviations theory via cumulant generating functions). — As usual with such inequalities, it's rarely clear whether the various factors of 33 (or whatever) are for real, or just the best one can do with the current method of proof; but they suffice for many purposes, and I suppose it's better to have the constants be explicit than to bury them inside O() or o().
In mixing theory, the analogous bounds are stated in terms of the mixing coefficients. Here the new bounds are stated in terms of a range of new dependence measures (ch. 2), all denoted by utterly un-mnenomic Greek letters (again following the mixing-theory convention). Some of them play directly off the definition of convergence in distribution in terms of expectations of test functions: how big can the difference between Dnf and P*Qf get, when f is restricted to some suitably well-behaved, normalized class of test functions? The other set of measures are "projective". The conditional expectation of a function f, given X1, X2, etc., is the projection of f on to the space of functions calculable from X1, X2, etc.; the unconditional expectation of f is the projection of f on to the space of constant functions, which are calculable from nothing. (By "calculable from" I mean "measurable in the minimal sigma-algebra of".) If f is independent of X, then its conditional expectation is still its projection on to the constants. The other class of dependence coefficients, therefore, looks at how far conditional expectations depart from constancy — either maximizing over some family of test-functions, or by making future values of the time series itself the function in question.
Applications to non-parametric regression, spectral-density estimation, and various econometric problems occupy chs. 11--13, and ch. 3 shows that many standard and some non-standard time series models are weakly dependent, with estimates of the dependence coefficients.
— The book shows definite signs of its patch-up job origins. The prose is disjointed and slightly repetitive. The notation, too, is not always consistent; sometimes the generic dependence measure is c, sometimes it's \mu, and it can switch back and forth from one line to the next. A bigger annoyance is terminology. What the authors (and, it must be said, many probabilists) call "mixing" is more properly strong mixing, i.e., convergence of joint distributions to product measure in the strong, not the weak, topology. (If the Adversary can pick any pair of events, and you can show that that pair become independent as the separation between them grows, you have proved ordinary mixing. If the Adversary gets to pick a different pair of events with each separation, and you can still show asymptotic independence, that's strong mixing.) In dynamical systems and ergodic theory, "mixing" unmodified refers to the weaker notion of mixing. (See for instance the detailed treatment in Lasota and Mackey.) Thus for someone of my background, the Bernoulli shift is one of the canonical mixing processes, but they give it as their first example of a non-mixing process! (Their proof, incidentally, is wrong, but easily fixed.) In fact, I'd say that the failure to connect what they have done to the work in dynamics on decay of correlations (e.g.) is a real lost opportunity for both fields — optimistically, a direction for future research.
In the end, however, these are all minor complaints. The authors have brought together a great deal of original and important work, as a result of which the classical probabilistic limit theorems can now be rigorously applied to a much wider range of stochastic models. It's cool stuff and I would be very happy to teach it. I strongly recommend the book to statisticians interested in inference for stochastic processes, learning theorists interested in dependent data, probabilists interested in new applications, and theoretical econometricians.
Warren Ellis and Paul Duffield, Freakangels, vol. 2
Jack Campbell, Lost Fleet, vol. 5: Relentless
Brain-candy, in the process of moving from the Anabasis to De Bello Civili. (Which does not mean it's a historical pastiche.) Earlier morsels: 1--3, 4.
David Freidel, Linda Schele and Joy Parker, Maya Cosmos: Three Thousand Years on the Shaman's Path
A detailed and intelligent, if very conjectural, exploration of ancient Maya cosmology, based on deciphering surviving inscriptions, artifacts, and extrapolation from the modern Maya. Marred by irritating and silly stabs at cultural relativism (mostly at the beginning and the end).
Peter D. Harrison, The Lords of Tikal: Rulers of an Ancient Maya City
History of the greatest Maya city from the viewpoint of its rulers (the only view we really have any access to), from the beginning to the end. Clear on the difference between what we actually know and what the Mayanists are merely guessing at. Very good on architecture.
Bruce Kitchens and Selim Tuncel, Finitary Measures for Subshifts of Finite Type and Sofic Systems
As you know, Bob, a shift dynamical system has a state-space consisting of infinite sequences (one or two sided), say x drawn from a finite alphabet. The time-evolution operator T, or shift, simply moves the sequence one step to the right, i.e., (Tx)n = xn+1. This moves all the complexity and distinctions between different systems into the set of initial conditions, and space of allowed sequences, rather than in the time-evolution rule. The full shift has all possible sequences among its initial conditions (and so the space never shrinks); sub-shifts restrict the set of initial conditions. Suppose that the allowed values of xn depend on xn-1, xn-2, xn-k, but not on earlier entries in x, and that this k is the same for all n and all sequences. Then we have a subshift of finite type, the type or order of the subshift being k. These are the symbolic-dynamical equivalents of Markov chains of order k. As with Markov chains, any higher-order subshift of finite type can be converted to a first-order subshift. (Define a new alphabet where each symbol is a block of length k from the original alphabet.) A sofic system can be defined in one of two equivalent ways. One introduces the notion of a follower set, the set of all one-sided infinite sequences which can succeed a given history; a subshift is sofic if it has only finitely many follower sets. Alternately, introduce the notion of factor maps — continuous functions from one sequence space to another which commute with the shift. A sofic system is the image of a subshift of finite type under a factor map. A strictly sofic system is one which is sofic, but not a subshift of finite type.
Said another way, for any sofic system there is a set of states, possibly hidden, where the current state determines what symbols can be seen next, and the current state plus the next symbol determines the next state. Sofic systems are (the languages generated by) deterministic finite automata.
All of this is at the purely topological, possible-or-not level. One can of course also add probability measures on sequences. A Markov measure is one under which the distribution of Xn depends on Xn-1, but not, given that, on previous symbols. (Similarly for higher-order chains.) In this monograph, Kitchens and Tuncel ask, what measures are to Markov measures as sofic systems are to subshifts of finite type? They call these measures finitary, and characterize them in several ways, including (1) measures where the number of follower distributions is finite, (2) factor-map images of Markov measures, (3) ones where the conditional distribution of the future is always independent of the remote past given a finite segment of the immediate past, though one of possibly variable, context-dependent length. (The last is what motivates the name "finitary".) They develop the theory of finitary measures in considerable detail, and in parallel to the usual theory of Markov measures, as that is found in ergodic theory and the thermodynamic formalism. Statistical aspects (the reconstruction of finitary measures from sample data) are not addressed — fortunately for me. Prior acquaintance with symbolic dynamics and the thermodynamic formalism is absolutely essential, and a lot of familiarity with manipulating semi-groups would help.
(Your best bet for obtaining this is directly from the publisher. However, the community would be better served by putting it on the arxiv.)
Misty Massey, Mad Kestrel
Brain-candy. More piracy and less romance than blurb implies.
John McGowan, Democracy's Children: Intellectuals and the Rise of Cultural Politics
Nowhere near as cohesive as the subtitle suggests. Rather it's a set of essays on the conditions, activities and aspirations of academic in the humanities (especially literature as such) in the US from the mid-1980s through the mid-1990s, along with some thoughts about how they ought to try relating to the larger society they're part of. (I am quite sympathetic to the latter.) I wish McGowan would write a book about "intellectuals and the rise of cultural politics"; I think it would be very interesting.
(His mid-1990s remarks on the Web are more amusing in retrospect than they would have been at the time, since he's gone from sniffing at it to guest-blogging for Bérubé. To be fair, the idea that English departments would need specialists in hypertext does seem rather dated — because we have all become specialists in hypertext.)
C. J. Sansom, Revelation
Michelle Sagara, Cast in Fury

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Writing for Antiquity; Enigmas of Chance; Mathematics; The Progressive Forces; The Commonwealth of Letters

Posted by crshalizi at April 30, 2009 23:59 | permanent link

April 28, 2009

That Word Does Not Exist In Any Language

Attention conservation notice: 1500 words on a dispute at the intersection of paleography and computational linguistics, two areas in which I have no qualifications; also a sermon on a text from Lichtenberg: "We must not seek to abstract from the busts of the great Greeks and Romans rules for the visible form of genius as long as we cannot contrast them with Greek blockheads."

Over the weekend, I read Mark Liberman's post at Language Log about the new Rao et al. paper in Science, claiming to show information-theoretically that the symbols recovered on artifacts from the Harappan civilization in the Indus Valley are in fact a form of writing, as had long been supposed but was denied a few years ago by Farmer, Sproat and Witzel.

What Rao et al. claimed to show is that the sequences of Indus symbols possess information-theoretic properties which are distinctive of written language, as opposed to other symbols sequences, say ones which are completely random (IID, their "type 1 nonlinguistic") or completely deterministic (their "type 2 nonlinguistic"). Specifically, they examined the conditional entropy of sequence pairs (i.e., the entropy of the next symbol given the previous one). The claim is that the Indus symbols have the same pattern for their conditional entropy as writing systems, which is clearly distinguishable from non-linguistic symbol sequences by these means.

As someone who is very, very into information theory (especially conditional entropy), I was intrigued, but also very puzzled by Mark's account, from which it seemed that Rao et al. had a huge gap where the core of their paper should be. Actually reading the paper convinced me that Mark's account was correct, and there was a huge logical fallacy. I'll reproduce Figure 1A from the paper and explain what I mean.

Rao et al. worked with a corpus of Indus Valley inscriptions, which recognizes 417 distinct symbol types. This is their "Indus" line. The other language lines come from different corpora in the indicated language. In each corpus, they filtered out the less common symbols, and then fit a first-order Markov chain. (Transition probabilities were estimated with a smoothing estimator rather than straight maximum likelihood.) Then they calculated the conditional entropy of the chain, using the estimated transition probabilities and the observed symbol frequencies (rather than say the invariant distribution of the chain); that's the vertical axis. The horizontal axis shows how many symbol types were retained --- i.e., "100 tokens" means that only the 100 most common symbols in the corpus were kept, and the chain was fit to those sequences. (This is not explained in the paper but was made clear in later correspondence between Sproat and the authors.) There are two lines for English, depending on whether "token" was taken to mean "character" (differentiating upper and lower case) or to mean "word".

The bottom curve shows the estimated conditional entropy from a purely deterministic sequence; the actual conditional entropy is in fact zero, so I presume that the upward trend is an artifact of the smoothed transition probabilities. The top curve, on the other hand, is from a uniform IID sequence --- here the real conditional entropy is the same as the marginal entropy, but both grow as N increases because the size of the symbol set grows. (I.e., this is an artifact of keeping only the most common symbols.)

Here is the flaw: there is no demonstration that only linguistic sequences have this pattern in their conditional entropies. Rao et al. have shown that two really extreme non-linguistic processes don't, but that's not a proof or even a plausibility argument. I would settle for an argument that non-linguistic processes have to be really weird to show this pattern, but even that is lacking. In Mayo's terms, they have not shown that this test has any severity.

Of course the fact that they haven't shown their test is severe doesn't mean that it isn't, in fact, severe. So, by way of procrastinating, I spent some time yesterday constructing a counter-example. My starting point was what Mark had done, generating a sequence of IID draws from a geometric distribution (rather than a uniform one) and subjecting it to the same analysis as Rao et al. As it happens, I had already written a function in R to fit Markov chains and calculate their log-likelihood, and here the conditional entropy is the negative log likelihood over the sequence length. (Admittedly this is only true using the maximum likelihood estimates for transition probabilities, rather than smoothed estimates as Rao et al. do, but my simulations had so much data this shouldn't matter.) Setting the rate of the geometric distribution to 0.075, here were my first results.

Mark Liberman and Richard Sproat did almost the same thing pretty much simultaneously, as you can see from the updates to Mark's post.

This was not entirely satisfactory, since (as Rao et al. point out in the online supplementary materials), there is a big gap between the marginal and conditional entropies for writing and for the Indus symbols. This was also, however, not too hard to finesse. In addition to the geometric sequence, I generated a Markov chain which alternated between the values +1 and -1, but where each positive or negative sign was 99% likely to be followed by the same sign. (That is, the signs were highly persistent.) I then multiplied the IID geometric variables (plus 1) by the Markov signs. This gave a larger set of symbols, but where knowing the sign of the current symbol (which "register" or "sub-vocabulary" it came from) was quite informative about the sign of the next symbol. (I added 1 to the geometric variables to exclude 0=-0, keeping the sub-vocabularies distinct.)


Pluses: marginal entropy; circles: conditional entropy

A third experiment takes after the fact that the Indus symbol sequences are all extremely short, at most a dozen characters or so. In stead of having a Markov chain for the sign, I used another, independent set of random draws, uniform on the integers from 2 to 6, to divide the sequence into blocks, and gave all the symbols in each block the same (coin-toss) sign.


Pluses: marginal entropy; circles: conditional entropy

(Because I'm doing everything with a single long sequence, I artificially introduce transitions from positive to negative signs, which lowers the gap between the conditional and unconditional entropies. If I wanted to do this properly, I'd re-write my Markov estimator so it used many short sequences; but that would be too much like real work.)

The mechanism producing the gap between conditional and unconditional entropies is that the marginal distribution of symbols is a mixture of several pure distributions, and which mixture component we draw from now influences which component we'll draw from next (so the sequence can be Markov, exchangeable, etc.). Given the mixture components, the symbols are independent and the conditional and unconditional entropies are equal. Without that knowledge, the first symbol in effect is a cue for figuring out the mixture component, reducing the entropy of the second. There is nothing specifically linguistic about this; any hidden Markov model does as much. It would, for instance, work if the "symbols" were characters in randomly-selected comic books, who cluster (though slightly imperfectly); if that's too low-brow, think about Renaissance paintings, and the odds of seeing St. John the Baptist as opposed to a swan in one which contains Leda.

I have made no attempt to match the quantitative details Rao et al. report for the Indus symbols, just the qualitative patterns. Were I to set out seriously to do so, I'd get rid of the geometric distribution, and instead use a hidden Markov model with more than two states, each state having a distinct output alphabet, the distribution of which would be a Zipf (as used by Liberman or Sproat) or a Yule-Simon. (I might also try block-exchangeable sequences, as in my third example.) But this would approach real work, rather than a few hours of procrastination, and I think the point is made. Perhaps the specific results Rao et al. report can only be replicated by making distributional assumptions which are very implausible for anything except language, but I'd say that the burden of proof is on them. If, for instance, they analyzed lots of real-world non-linguistic symbol systems (like my comic books) and showed that all of them had very different conditional entropy curves than did actual writing, that would be a start.

I should in the interest of full disclosure say that a number of years ago Farmer and I corresponded about his work on the development of pre-modern cosmologies, which I find interesting and plausible (though very conjectural). But if anything I hope Farmer et al. are wrong and the Indus Valley civilization was literate, and I'd be extra pleased if the language were related to Tamil. Unfortunately, if that's the case it will need to be shown some other way, because these conditional entropy calculations have, so far as I can see, no relevance to the question at all.

My code (in R) is here if you want to play with this, or check if I'm doing something stupid. In the unlikely event you want more, I suggest reading the reply of Farmer et al., Rahul Siddharthan (especially the comments), or Fernando Pereira; the last is probably the wisest.

Manual trackback: Metadatta; Language Hat

Enigmas of Chance; Writing for Antiquity

Posted by crshalizi at April 28, 2009 22:30 | permanent link

April 10, 2009

Next Week at the Statistics Seminar: Bayes, Bayes, Baked Beans, Sausage and Bayes

Next week at the CMU statistics seminar, we give you all the Bayes you could want and more:

Tom Griffiths, "Connecting human and machine learning via probabilistic models of cognition"
Abstract: Human performance defines the standard that machine learning systems aspire to in many areas, such as forming new concepts, making scientific discoveries, and learning language. This suggests that studying human cognition may be a good way to develop better learning algorithms, as well as providing basic insights into how the mind works. However, in order for ideas to flow easily from psychology to computer science and vice versa, we need a common language for describing human and machine learning. I will summarize recent work exploring the hypothesis that probabilistic models of cognition, which view learning as a form of statistical inference, provide such a language, including results that illustrate how novel ideas from statistics can inform cognitive science.
Date and place: Monday, 13 April 2009, 4-5 pm in Porter Hall 125C
Peter Orbanz, "Conjugate Projective Limits"
Abstract: Bayesian nonparametric models are essentially Bayesian models on infinite-dimensional spaces. Most work along these lines in statistics focusses on probability models over the simplex. In machine learning, the problem has recently received much attention as well, and attempts have been made to define models on a wider range of infinite-dimensional objects, including measures, functions, and infinite permutations and graphs.
In my talk, I will discuss the construction of nonparametric Bayesian models from finite-dimensional Bayes equations, roughly analogous to the Kolmogorov extension of systems of measures to their projective limits. I will present an extension theorem applicable to regular conditional probabilities. This can be used to study whether "conditional" properties of the finite-dimensional marginal models, such as conjugacy and sufficiency, carry over to the infinite-dimensional projective limit model, and to determine the functional form of the nonparametric Bayesian posterior if the model is conjugate.
Date and Place: Friday, 17 April 2009, 1-2 pm, Porter Hall A18A
Both talks are free and open to the public.

Enigmas of Chance; Minds, Brains, and Neurons

Posted by crshalizi at April 10, 2009 13:08 | permanent link

In Another Green World

I will be completely offline from the 11th to 20th, while I contemplate whether certain referees deserve to be fed to jaguars, or whether it would not be more humane to sacrifice them to the great feathered serpent. (I mean humane to the cats, of course.)

Self-centered

Posted by crshalizi at April 10, 2009 13:00 | permanent link

April 03, 2009

Next Week at the Statistics Seminar: "Methods and Models for Time-Dependent Relational Data"

Next week at the CMU statistics seminar:

Andrew C. Thomas, "Methods and Models for Time-Dependent Relational Data"
Abstract: In recent years there has been a proliferation of relational data, including studies of social networks, where outcomes are measured over time. Methods that use binary outcomes traditionally rely on Markov chain assumptions that indicate the generation, maintenance and severing of network ties, often relying on dichotomized perceptions of relationships. I present a hierarchical latent variable model for time-dependent relations that makes use of dynamic programming techniques for its solution, and discuss the expansion of this model to non-binary ties, with several real and synthetic examples of social and professional interactions.
Date and place: Monday, 6 April 2009, 4-5 pm in Porter Hall 125C
Free and open to the public.

Networks; Enigmas of Chance

Posted by crshalizi at April 03, 2009 13:37 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems