Error: I'm afraid this is the first I've heard of a "Rss" flavoured Blosxom. Try dropping the "/+Rss" bit from the end of the URL.

Sun, 05 Feb 2012

Time Series, or Statistics for Stochastic Processes and Dynamical Systems

Rates of convergence of estimators; analogs to VC-dimension results (see Meir's paper below). Large deviation techniques. Prediction schemes. Are there universal schemes which do not demand exponentially growing volumes of data?

If you have an ergodic process, then the sample-path mean for any nice statistic you care to measure will, almost surely, converge to the distributional mean. This is even true of trajectory probabilities (i.e., if you want to know the probability of a certain finite-length trajectory, simply count how often it happens). So "sit and count" is a reliable and consistent statistical procedure. If the process mixes sufficiently quickly, the rate of convergence might even be respectable. But this doesn't say anything about the efficiency of such procedures, which is surely a consideration. And what do you do for non-ergodic processes? (Take multiple runs and hope they're telling you about different ergodic components?) Non-stationary, even?

I need to learn more about frequency-domain approaches; despite being raised as a physicist, I find the time domain much more natural. After all, the frequency domain is effectively just one choice of a function basis, and there are infinitely many others, which might in some sense be more appropriate to the process at hand. But that's at least in part a rationalization against having to learn more math.

LSE econometrics and its "general-to-specific" modeling procedure is very interesting, and I think possibly even related to stuff I've done, but I need to understand it much better than I do.

(This notebook really needs subdivision.)

See also: Bootstrapping; Change-Point Problems; Control Theory; Dynamical Systems; Ergodic Theory; Filtering, State Estimation and Signal Processing; Grammatical Inference; Information Theory; Machine Learning, Statistical Inference and Induction; Markov Models and Hidden Markov Models; Neural Coding; Power Law Distributions, 1/f Noise and Long-Memory Processes; Recurrence Times of Stochastic Processes (also Hitting, Waiting, and First-Passage Times) Sequential Decisions Under Uncertainty; State-Space Reconstruction; Statistical Learning Theory with Dependent Data; Statistics; Stochastic Processes; Symbolic Dynamics; Universal Prediction Algorithms

#

Sat, 04 Feb 2012

Recommended Science Fiction
These range from merely good reads to really outstanding books. A raw ranking of them would be of little use to others, unless I explained why I gave them the ranks I did, and anyway I'd probably give different rankings by the time your read this. (When I know of an on-line review about a book which I agree with --- e.g., because I wrote it --- I've included a link; also some exceedingly short remarks about interesting cases.)

See also Fantasy and Horror Recommendations; Science Fiction

#

Recommended Fantasy Books
These range from merely good reads to really outstanding books; but rather than trying to rate each one, or (what would be more to the point) explain my ratings, I've merely listed them without any particular indication of rank. Horror novels are included here for want of anyplace better to put them. Titles are added as they occur to me.

Links on titles are generally to my review or briefer comments, if I have any.

See also Fantasy; Science Fiction Recommendations.

#

Fri, 03 Feb 2012

Dynamic Stochastic General Equilibrium Models in Macroeconomics (DSGEs)

Pretend that the national economy consists of a single person, the "representative agent". This agent owns all the goods, especially all the capital goods, and does all the work in the economy. The agent is greedy for material consumption, and lazy. To consume, which it likes, it must produce, which is a matter of indifference, except that to produce it must work, which it dislikes. If it produces more now than it consumes, it can save the difference as capital goods, which make its future labor more productive. There are also shocks to "technology", i.e., to how effectively it can use capital to turn labor into consumption goods; rather bizarrely, these shocks are both negative and positive, which means that it regularly forgets productive technologies, and not because better replacements have come along.

In addition to being greedy and lazy, the agent is is determined to act now so as to maximize not present utility, but the discounted future stream of utility at all times (since it is also immortal). Fortunately, it is incredibly foresighted, and knows the exact distribution of future shocks to technology. (This distribution is not changed by anything the agent does; or, if you like, it always acts in such a way that its expectations are exactly fulfilled.) Possessing unlimited cognitive resources, it is easy for the agent to solve the resulting dynamic programming problem optimally. This will not lead to a smooth pattern of production, investment and consumption; if, for instance, there is a big negative shock to technology, and shocks are persistent, it becomes rational to slack off now, and enjoy leisure; extra work will be more rewarded later when the agent will have remembered how to do stuff. These fluctuations are, supposedly, the fluctuations of the macroeconomy, the business cycle.

I have sketched this sort of model in a deliberately hostile way, because I think such things are remarkably silly. But many very eminent economists regard them very highly indeed. Mostly I think this reflects badly on the discipline of macroeconomics, but it does raise some interesting technical problems, like:

(You might well ask "where is the equilibrium, let alone the general equilibrium, in a model with one agent and no trade?" You might very well ask that.)

#

Mon, 23 Jan 2012

Literary Criticism and Theory of Criticism

There are a great many books to read; there are many place to travel to. Travellers are often much better for advice --- where to go, where to avoid, what to know and what to do to get the most out of their trip. It is my humble opinion that works of literary criticism are the travel books of the written world --- sometimes guides (and it is, for instance, a rash traveller who visits Wallace Stevens without one), sometimes reportage, travelogues, impressions. This is, or can be, a worthwhile enterprise, but it does not sound like one which needs or would benefit from a vast and obscure body of theory, nor one whose successful practioners are likely to be able theorists.

What, then, accounts for the current deluge of theory of criticism, as opposed to criticism proper (and as opposed to critical theory, a different beast altogether)? I have no idea, but I feel licensed by the subject matter to speculate as to the causes.

  1. Disaffection. Mencken observed seventy or eighty years ago: "Every now and then, a sense of the futility of their daily endeavors falling suddenly upon them, the critics of Christendom turn to a somewhat sour and depressing consideration of the nature and objects of their own craft." This however merely backs things up one stage: why should critics feel that criticism is not enough, and practice it? Failing to practise criticism, why don't they give up and become actual novelists, poets, etc.? (Frank Lentricchia has finally taken this honorable course.)
  2. Vicious cycle. Suppose that, for whatever reason, theory of criticism came to be prized more highly than criticism itself. Then it would be to the benefit of fledgling literary scholars to turn to theory, and to continue to place a high value upon it. (This last is important, since the study of literature, at least in the West, is close to self-governing.) Selection can take it from there, though that is not a guarantee that the result will be sustainable. Obvious query: why should theory be more valued than criticism? Second obvious query: what are the coefficients of selection?
  3. Professional deformation. During this century, and especially since the Second World War, criticism, and literary culture generally, have migrated into academia in the most striking way. The qualities needed by a good critic --- "intelligence, toleration, wide information, genuine hospitality to ideas," to keep with Mencken --- are hard to inculcate in a lecture or seminar, and make very poor dissertation material. But theory of criticism, however appalling (perhaps especially if appalling) can be lectured on and debated endlessly and published. (And cited. Criticism of, say, Milton, is unlikely to be cited by anyone but other Milton scholars; but theory of criticism can be cited by other theorists and by critics.) Because they no longer need appeal to any public other than themselves, the usual concentration of mutants and anomalies found in small, in-bred populations may be expected.
  4. Physics Envy. Modesty forbids me to elaborate on this.
  5. Spirit of the Age. It has sometimes been claimed that "we" are now much more self-conscious and reflexive than our predecessors. This would seem to fit with critics preferring to theorize about criticism to criticizing, but the exact relationship is obscure. Would the general increase in self-consciousness explain the shift to theory, or would the shift be part of what is meant by the general increase in self-consciousness?

But at this point a doubt arises. Has higher-order writing grown faster than direct, first-order literature or its immediate, second-order criticism? I know of no statistics on this, so I made some very crude ones of my own, by counting the number of titles in the UW-Madison on-line catalog assigned to various Library of Congress call numbers. Books in the category PN, which are about literature in general, grew at 4.1 +- 0.2 percent between January 1950 and April 1998; the PS, PR and PZ categories, which roughly comprise literature in English (with some translations in PZ, and criticism in PS and PR) at only 2.9 +- 0.1 percent. By way of comparison, the QC category, which is (almost all of) physics grew at 4.8 +- 0.4, and the combination of PG, PQ and PT (literature in modern European languages other than English) at 3.9 +- 0.2. (The numbers are from a least-squares fit to a simple exponential curve, so the error bars should be taken with grains of salt.) The growth of non-English literature is probably mostly a change in our acquisition policy, but the difference between English literature and writing about literature is clearly statistically significant. Going from the number of books to the number of writers and so to something like relative fitnesses for different sorts of literary writers would, however, be pretty difficult. (Thanks to Jason Hsu for pointing out an unfortunate ambiguity of wording.)

At some point I should use this space to record some thoughts about what a natural history of literature would look like, and how it would differ from hitherto-existing literary criticism; but really I should be working now, and you can probably figure out what I'd say from my contribution to the Valve's symposium on Moretti. At the same time, because I seem to have been unclear about this, I should emphasize that I don't think that sort of natural history is the only sort of literary scholarship, much less the only sort of literary criticism, worth pursuing.

See also Analogy and Metaphor; Books and Their History; Cognitive Science; Cultural Criticism; Epics and Oral Poetry; Fantasy; Sigmund Freud and Psychoanalysis; Intellectual Standards and Competence; Intellectuals; Linguistics; Modernity and All That; Mysteries; Myths; Narratives; Novels; Poetry, Poets; Postmodernism, Poststructuralism, etc.; the Romanticists; Rhetoric; Science Fiction; Semiotics; Structuralism; Universal signs, images and symbols

#

Graphical Models

[Update, 12 March 2010: On re-reading I am less than happy with this, because I have come appreciate the uses of graphical models in non-causal modeling more, and this slights them unnecessarily. I will try to re-write this soon.]

A.k.a. causal models, causal graphs, Bayes graphs, Bayes networks, Bayesian networks. (Here "Bayes" is a metonym for "conditional probability". There are perfectly good frequentist interpretations of these models.) I'm sticking latent-variable and path-analysis models in here, too, because they all pretty much work the same way.

Everyone who takes basic statistics has it drilled into them that "correlation is not causation." (When I took psych. 1, the professor said he hoped that, if he were to come to us on our death-beds and prompt us with "Correlation is," we would all respond "not causation.") This is a problem, because one can infer correlation from data, and would like to be able to make inferences about causation. There are typically two ways out of this. One is to perform an experiment, preferably a randomized double-blind experiment, to eliminate accidental sources of correlation, common causes, etc. That's nice when you can do it, but impossible with supernovae, and not even easy with people. The other out is to look for correlations, say that of course they don't equal causations, and then act as if they did anyway. The technical names for this latter course of action are "linear regression" and "analysis of variance," and they form the core of applied quantitative social science, e.g., The Bell Curve.

Graphical models are, in part, a way of escaping from this impasse.

The basic idea is as follows. You have a bunch of variables, and you want to represent the causal relationships, or at least the probabilistic dependencies, between them. You do so by means of a graph. Each node in the graph stands for a variable. If variable A is a cause of B, then an arrow runs from A to B. If A is a cause of B, we also say that A is one of B's parents, and B one of A's children. If there is a causal path from A to B, then A is an ancestor of B, and B is a descendant of A. If a variable has no parents in the graph, it is exogenous, otherwise it is endogenous.

Part of what we mean by "cause" is that, when we know the immediate causes, the remoter causes are irrelevant --- given the parents, remoter ancestors don't matter. The standard example is that applying a flame to a piece of cotton will cause it to burn, whether the flame came from a match, spark, lighter or what-not. Probabilistically, this is a conditional indepedence property, or a Markov property: a variable is independent of its ancestors conditional on its parents. In fact, given its parents, its children, and its childrens' other parents, a variable is conditionally independent of all other variables. This is called the graphical or causal Markov property. When this holds, we can factor the joint probability distribution for all the variables into the product of the distribution of the exogenous variables, and the conditional distribution for each endogenous variable given its parents.

(You may be wondering what happens if A is a parent of B and B is a parent of A, as can happen when there is feedback between the variables. This leads to difficulties, traditionally dealt with by explicitly limiting the discussion to acyclic graphs. I shall follow this wise precedent here.)

Now, there are certain rules which let us infer conditional independence relations from each other. For instance, if X is independent of the combination of Y and W, given Z, then X is indepdent of Y alone given Z. So, if we have a graph which obeys the causal Markov condition, there are generally other conditional independence relations which follow from the basic ones. If these are the only conditional indepences which hold in the distribution, it is said to be faithful to the graph (or vice versa); otherwise it is unfaithful. For a graph to be Markov and unfaithful, there must (as it were) be an elaborate conspiracy among the conditional distributions, so elaborate that it will generally be destroyed by any change in any of those distributions. So faithfulness is a robust property.

This may sound pretty arcane, but that's just because it is arcane. The point, however, is that if you can make the three assumptions above (no causal cycles, Markov property, faithfulness), you're in business in a really remarkable way. There are very powerful statistical techniques that will let you infer the causal structure connecting your variables. This comes in two flavors. One is the Bayesian way: cook up a prior distribution over all possible causal graphs; compute the likelihood of the data under each graph; update your distribution over graphs; iterate. This is generally computationally intractable, assuming you can come up with a meaningful prior in the first place. The other approach is to use tests for conditional independence to eliminate possible connections between variables, and so to narrow down the range of candidate structures; it is basically frequentist, and can be shown, under a broad range of circumstances, to be asymptotically reliable.

Once you have your causal graph --- whether through estimation or through simply being handed one --- you can do lots of great things with it, like predict the effects of manipulating some of the variables, or make backward inferences from effects to causes. Of course, if the graph is big, doing the necessary calculations can be very troublesome in itself, and so people work on approximation methods and even ways of doing statistical inference on models of statistical distributions...

It's probably obvious I think this is incredibly neat, and even one of the most important ideas to come out of machine learning. Of course it doesn't really solve the problem of establishing causal relations, in the way Hume objected to; it says, assuming there are causal relations, of a certain stochastic form, and that these are stable, then they can be learned. But that, and the more general questions of what we ought to mean by "cause", deserve a notebook of their own.

Things I want to understand better: frequentist inference procedures. Computational learning theory for graphical models (the paper by Janzing and Herrmann is good). How to treat systems with feedback? How to treat dynamical systems and time series? How does all of this fit together with computational mechanics?

Not even a conjecture. Back in the 1960s, Chow and Liu (reference below) gave a polynomial algorithm for finding the best approximation to a global joint probability distribution using only pairwise interactions among the variables, i.e., the one which minimized the Kullback-Leibler divergence between the true and the approximating distribution. I have read that extending this to even three-way interactions is NP, though I don't know if it's NP-complete. (1) How is the intractability result established? (2) Is this the same as the computational phase transition one finds in going from 2-SAT to 3-SAT, where the critical point is at two-point-something SAT? (Presumably the answer to (1) would shed some light on this.) (3) Even if not, is there an analogous phase transition, perhaps in a different universality class? (Update in 2009, several years later: Bento and Montanari, below, sounds relevant, but I haven't read it yet.)

(Thanks to Gustavo Lacerda for pointing out a goof.) #

Model Selection

(Reader, please make your own suitably awful pun about the different senses of "model selection" here, as a discouragement to those finding this page through prurient searching. Thank you.)

In statistics and machine learning, "model selection" is the problem of picking among different mathematical models which all purport to describe the same data set. This notebook will not (for now) give advice on it; as usual, it's more of a place to organize my thoughts and references...

Classification of approaches to model selection (probably not really exhaustive but I can't think of others, right now):

Direct optimization of some measure of goodness of fit or risk on training data.
Seems implicit in a lot of work which points to marginal improvements in "the proportion of variance explained", mis-classification rates, "perplexity", etc. Often, also, a recipe for over-fitting and chasing snarks. What's wanted is (almost always) some way of measuring the ability to generalize to new data, and in-sample performance is a biased estimate of this. Still, with enough data, if the gods of ergodicity are kind, in-sample performance is representative of generalization performance, so perhaps this will work asymptotically, though in many cases the researcher will never even glimpse Asymptopia across the Jordan.
Optimize fit with model-dependent penalty
Add on a term to each model which supposed indicates its ability to over-fit. (Adjusted R^2, AIC, BIC, ..., all do this in terms of the number of parameters.) Sounds reasonable, but I wonder how many actually work better, in practice, than direct optimization. (See Domingos for some depressing evidence on this score.)
Classical two-part minimum description length methods were penalties; I don't yet understand one-part MDL.
Penalties which depend on the model class
Measure the capacity of a class of models to over-fit; penalize all models in that class accordingly, regardless of their individual properties. Outstanding example: Vapnik's "structural risk minimization" (provably consistent under some circumstances). Only sporadically coincides with *IC-type penalties based on the number of parameters.
Cross-validation
Estimate the ability to generalize to different data by, in fact, using different data. Maybe the "industry standard" of machine learning. Query, how are we to know how much different data to use? — .
Query, how are we to cross-validate when we have complex, relational data? That is, I understand how to do it for independent samples, and I even understand how to do it for time series, but I do not understand how to do it for networks, and I don't think I am alone in this. (Well, I understand how to do it for Erdos-Renyi networks, because that's back to independent samples...)
The method of sieves
Directly optimize the fit, but within a constrained class of models; relax the constraint as the amount of data grows. If the constraint is relaxed slowly enough, should converge on the truth. (Ordinary parametric inference, within a single model class, is a limiting case where the constraint is relaxed infinitely slowly, and we converge on the pseudo-truth within that class [provided we have a consistent estimator].)
Encompassing models
The sampling distribution of any estimator of any model class is a function of the true distribution. If the true model class has been well-estimated, it should be able to predict what other, wrong model classes will estimate, but not vice versa. In this sense the true model class "encompasses the predictions" of the wrong ones. ("Truth is the criterion both of itself and of error.")
General or covering models
Come up with a single model class which includes all the interesting model classes as special cases; do ordinary estimation within it. Getting a consistent estimator of the additional parameters this introduces is often non-trivial, and interpretability can be a problem.
Model averaging
Don't try to pick the best or correct model; use them all with different weights. Chose the weighting scheme so that if one is best, it will tend to be more and more influential. Often I think the improvement is not so much from using multiple models as from smoothing, since estimates of the single best model are going to be more noisy than estimates of a bunch of models which are all pretty good. (This leads to ensemble methods.)
Adequacy testing
The correct model should be able to encode the data as uniform IID noise. Test whether "residuals", in the appropriate sense, are IID uniform. Reject models which can't hack it. Possibly none of the models on offer is adequate; this, too, is informative. Or: models make specific probabilistic assumptions (IID Gaussian noise, for example); test those. Mis-specification testing.

The machine-learning-ish literature on model selection doesn't seem to ever talk about setting up experiments to select among models; or do I just not read the right papers there? (The statistical literature on experimental design tends to talk about "model discrimination" rather than "model selection".)

#

Neural Modeling and Data Analysis

Especially, but not exclusively, modeling of spike trains (which is important for neural coding, and overlaps therewith).

Things to investigate: How easy would it be to adapt spike-sorting algorithms to cluster or classify other kinds of time series? Easy or not, would there be any point?

What's up with all the papers on using Ising models (and their variants) to model neural interactions? Some very respectable people are involved, but just saying the words makes me dubious. What's been done on using graphical-model structure learning for neural data?

See also: Neural Coding; Synchronization in Neural Systems; Neuroscience in general

#

Forecasting Non-Stationary Processes

Some non-stationary processes are in fact easy to forecast: periodic ones, for example, are strictly speaking not stationary. An ergodic Markov chain started far from its invariant distribution is also non-stationary, but easy to predict (it will approach the stationary distribution). Both of these cases are conditionally stationary, which I think is all that's really needed.

What's more interesting is the problem of so to speak really non-stationary processes. It's hard to imagine that there is any way to truly predict an arbitrary non-stationary process. (Basically: as soon as you think you have established a trend-line, the Adversary can always reverse the trend, without creating any problems of consistency with earlier data.) If you can constrain the class of allowable non-stationary processes, however, then something might be possible. Alternately, one might lower expectations, not to actually predicting well, but to predicting with low regret.

I actually have an Idea about using model averaging here, but need to find the time to work on it.

See also: Ensemble Methods in Machine Learning; Time Series; Universal Prediction

#

Relational Learning

That is, learning models of mathematical relations and relational structures from data, not learning in a relational manner.

See also: Graphical Models; Network Data Analysis; Statistics of Structured Data; Data Mining; Machine Learning, Statistical Inference, and Induction; Mathematical Logic

#

Analysis of Network Data

That is, of data on the form of networks --- I don't (as such) care about packet flow or other aspects of computer networks...

Things I wish I knew how to do: bootstrap a network, non-parametrically. (The model with a fixed degree sequence is a start, but what's the equivalent of the block bootstraps used for time series, which preserve dependence?) Cross-validation on networks. (You could say that link prediction is leave-one-out CV, but how about k-fold CV?) Estimate a distribution over networks by somehow smoothing an adjacency matrix. — These may or may not be three aspects of a single problem.

Community discovery is an important sub-topic, and I like exponential family random graph models enough to give them their own notebook.

Although many of the relevant papers appear in the journal Social Networks, published by Elsevier, the company responsible for deliberately publishing pseudo-journals such as The Australasian Journal of Bone and Joint Medicine, I know of no particular reason to believe that their findings are problematic. It would, however, be good if the community could shift to a journal whose publishers do not subvert the peer-review process whenever they find it profitable to do so.

See also: Complex networks; Community discovery; Exponential families of random graph models; Homophily vs. influence; Relational learning Social networks; Statistics in general; Statistics of structured data;

#

Sun, 22 Jan 2012

Finance, Banking, "the Markets"

Probably for as long as there has been money, there have been people who had more of it than they wanted to spend right away, and many more people who wanted to spend more money than they had. If money could somehow pass from the first group to the second, people would be better off; the function of financial markets is to ease this passage. Their point is to keep excess funds from sitting idle, by allocating them among the different people and projects asking for money. Ideally, just as ordinary markets allocate goods and services to those for whom they are most "valuable" (i.e., have the highest combination of desire and ability to pay), financial markets should allocate money — which is, after all, a claim on the resources of the community — to its most valuable, most productive uses.

Such markets are necessarily strange: those who have the money, the savers, by definition do not want any tangible good or service those on the other side, the borrowers, can currently sell. (Otherwise, it would be an ordinary commercial transaction and not finance.) The trick is that borrowers sell savers promises of more money in the future, in return for which they get money now. Etymiologically, at least, to extend credit is to believe (Latin credere) this promise. All this has been going on since Gilgamesh was king in Uruk.

To give some concrete examples: A corporate bond is a promise by the corporation to make regular interest payments for a number of years, ending with a lump-sum principal payment of the bond's face value. A common stock is a promise to get a fixed share of a firm's profits, along with a vote in how it is run. The once-standard home mortgage was a promise to make regular payments over, say, 30 years at a fixed interest rate, with the house, and a down payment on it, as hostages for the fulfillment of this promise.

Because financial instruments are promises, there is an intimate connection between them and predictions. All else being equal, how much you should pay for a bond depends on how much you prefer money now to money later, but also on how likely it is that the company will fulfill its promises. Turned around, a company which is widely believed to be able to keep its promises can offer to pay back less than one which is widely predicted to have trouble coming. Similarly for stock: if you just buy and hold on to a stock, the price to pay for a share depends on your prediction of the firm's future profits.

Since, as the saying goes, "prediction is difficult, especially of the future", this makes pricing financial instruments hard enough, but there are further complications. There are often times when lenders wish they had money now, rather than just a promise. Demanding immediate repayment from their debtors, while it certainly happens, often yields disappointingly little, and can disrupt or even crush a useful enterprise that could have kept making ordinary payments. The second big trick of financial markets is to make promises of payment re-sellable, so that the payments go to whoever currently owns the promise, not the original lender; the promise becomes a marketable "security". When we speak of "the financial markets", we usually mean the secondary markets in promises. When one of these secondary markets exists, the value of a security depends not just on the direct, promised payments, but also on the resale price of the security --- most notably, the value of a share of stock depends not just on what the firm's profits will be, but also on what other people will be willing to pay for a share of them. The latter price, will of course, depend on the same things, but even further into the future, and so on.

At this point, it might seem that the original objective of figuring out good uses for the community's capital has fallen out of view; this is superficial impression is, of course, correct. A large and able school of economists has created a great deal of confusion on this score by pushing something they call "the efficient markets hypothesis", which holds that it is basically impossible to anticipate the evolution of financial market prices. This is not quite true — it is merely very difficult and hazardous — but in any case it is very much a separate issue from whether securities prices actually are reliable signals about the relative value of different uses for capital. Nonetheless, in this age of the world we have, collectively, come to decide that financial markets beat any conceivable alternative at this, and accordingly loaded them with more and more power and responsibility; with what results, you can see around you.

— This does not explain how a socialist with no formal training in economics came to write for Quantitative Finance and teach in a computational finance program, but another time.

See also: Corporations and Corporate Finance; Economics; Globalization; Time Series

#

Ensemble Methods in Machine Learning

Boosting, bagging, binning, stacking, mixtures of experts, ...

I have an Idea about how to use model averaging to cope with non-stationary time series forecasting, but need to find time to work on it.

Value of diversity.

See also: Collective Cognition; Learning Theory; Model Selection

#

Thu, 19 Jan 2012

Collective Cognition

Rather than repeating myself about what I mean by "collective cognition", I refer you to my review of Ed Hutchins's Cognition in the Wild, and the introduction to the 2002 SFI Workshop on Collective Cognition I co-organized (that introduction is primarily based on an essay I wrote as a distraction from finishing my dissertation). I stole the phrase from Philip Agre, who told me he doesn't remember whence he got it. (This is fitting.)

The workshop was my first experience of helping to organizing a scientific meeting, and quite enlightening. The focus shifted quite a bit from what I originally had in mind, but I still think the papers presented were good; many of them are available via the link for the workshop above.

Prediction markets, which I think are horribly over-rated, probably deserve a notebook of their own.

See also: Computational Models of Linguistic Evolution; Duality between Knowledge Centralization and Market Completeness; Emergent Properties; Ensemble Methods in Machine Learning; Evolving Local Rules to Perform Global Computations; Flocking and Swarms; Institutions; Sociology of Science

#

Wed, 18 Jan 2012

Decision Theory

By which I mean the various mathematical theories of optimal decison-making; a division of both statistics and economics. This is a fairly distinct topic from actual human decision-making, since people do not seem to conform very well to any of the theoretical ideals. This sometimes leads to much wailing and gnashing of teeth over our irrationality; if anything, however, it leads me to doubt that these theories are good formalizations of rationality. Nonetheless, they're mathematically interesting, and they do have certain very nice properties in the situations where you can actually get them to work.

See also: Sequential Decision Making Under Stochastic Uncertainty

#

Sun, 08 Jan 2012

Power Law Distributions, 1/f Noise, Long-Memory Time Series

Why do physicists care about power laws so much?

I'm probably not the best person to speak on behalf of our tribal obsessions (there was a long debate among the faculty at my thesis defense as to whether "this stuff is really physics"), but I'll do my best. There are two parts to this: power-law decay of correlations, and power-law size distributions. The link is tenuous, at best, but they tend to get run together in our heads, so I'll treat them both here.

The reason we care about power law correlations is that we're conditioned to think they're a sign of something interesting and complicated happening. The first step is to convince ourselves that in boring situations, we don't see power laws. This is fairly easy: there are pretty good and rather generic arguments which say that systems in thermodynamic equilibrium, i.e. boring ones, should have correlations which decay exponentially over space and time; the reciprocals of the decay rates are the correlation length and the correlation time, and say how big a typical fluctuation should be. This is roughly first-semester graduate statistical mechanics. (You can find those arguments in, say, volume one of Landau and Lifshitz's Statistical Physics.)

Second semester graduate stat. mech. is where those arguments break down --- either for systems which are far from equilibrium (e.g., turbulent flows), or in equilibrium but very close to a critical point (e.g., the transition from a solid to liquid phase, or from a non-magnetic phase to a magnetized one). Phase transitions have fluctuations which decay like power laws, and many non-equilibrium systems do too. (Again, for phase transitions, Landau and Lifshitz has a good discussion.) If you're a statistical physicist, phase transitions and non-equilibrium processes define the terms "complex" and "interesting" --- especially phase transitions, since we've spent the last forty years or so developing a very successful theory of critical phenomena. Accordingly, whenever we see power law correlations, we assume there must be something complex and interesting going on to produce them. (If this sounds like the fallacy of affirming the consequent, that's because it is.) By a kind of transitivity, this makes power laws interesting in themselves.

Since, as physicists, we're generally more comfortable working in the frequency domain than the time domain, we often transform the autocorrelation function into the Fourier spectrum. A power-law decay for the correlations as a function of time translates into a power-law decay of the spectrum as a function of frequency, so this is also called "1/f noise".

Similarly for power-law distributions. A simple use of the Einstein fluctuation formula says that thermodynamic variables will have Gaussian distributions with the equilibrium value as their mean. (The usual version of this argument is not very precise.) We're also used to seeing exponential distributions, as the probabilities of microscopic states. Other distributions weird us out. Power-law distributions weird us out even more, because they seem to say there's no typical scale or size for the variable, whereas the exponential and the Gaussian cases both have natural scale parameters. There is a connection here with fractals, which also lack typical scales, but I don't feel up to going into that, and certainly a lot of the power laws physicists get excited about have no obvious connection to any kind of (approximate) fractal geometry. And there are lots of power law distributions in all kinds of data, especially social data --- that's why they're also called Pareto distributions, after the sociologist.

Physicists have devoted quite a bit of time over the last two decades to seizing on what look like power-laws in various non-physical sets of data, and trying to explain them in terms we're familiar with, especially phase transitions. (Thus "self-organized criticality".) So badly are we infatuated that there is now a huge, rapidly growing literature devoted to "Tsallis statistics" or "non-extensive thermodynamics", which is a recipe for modifying normal statistical mechanics so that it produces power law distributions; and this, so far as I can see, is its only good feature. (I will not attempt, here, to support that sweeping negative verdict on the work of many people who have more credentials and experience than I do.) This has not been one of our more successful undertakings, though the basic motivation --- "let's see what we can do!" --- is one I'm certainly in sympathy with.

There have been two problems with the efforts to explain all power laws using the things statistical physicists know. One is that (to mangle Kipling) there turn out to be nine and sixty ways of constructing power laws, and every single one of them is right, in that it does indeed produce a power law. Power laws turn out to result from a kind of central limit theorem for multiplicative growth processes, an observation which apparently dates back to Herbert Simon, and which has been rediscovered by a number of physicists (for instance, Sornette). Reed and Hughes have established an even more deflating explanation (see below). Now, just because these simple mechanisms exist, doesn't mean they explain any particular case, but it does mean that you can't legitimately argue "My favorite mechanism produces a power law; there is a power law here; it is very unlikely there would be a power law if my mechanism were not at work; therefore, it is reasonable to believe my mechanism is at work here." (Deborah Mayo would say that finding a power law does not constitute a severe test of your hypothesis.) You need to do "differential diagnosis", by identifying other, non-power-law consequences of your mechanism, which other possible explanations don't share. This, we hardly ever do.

Similarly for 1/f noise. Many different kinds of stochastic process, with no connection to critical phenomena, have power-law correlations. Econometricians and time-series analysts have studied them for quite a while, under the general heading of "long-memory" processes. You can get them from things as simple as a superposition of Gaussian autoregressive processes. (We have begun to awaken to this fact, under the heading of "fractional Brownian motion".)

The other problem with our efforts has been that a lot of the power-laws we've been trying to explain are not, in fact, power-laws. I should perhaps explain that statistical physicists are called that, not because we know a lot of statistics, but because we study the large-scaled, aggregated effects of the interactions of large numbers of particles, including, specifically, the effects which show up as fluctuations and noise. In doing this we learn, basically, nothing about drawing inferences from empirical data, beyond what we may remember about curve fitting and propagation of errors from our undergraduate lab courses. Some of us, naturally, do know a lot of statistics, and even teach it --- I might mention Josef Honerkamp's superb Stochastic Dynamical Systems. (Of course, that book is out of print and hardly ever cited...)

If I had, oh, let's say fifty dollars for every time I've seen a slide (or a preprint) where one of us physicists makes a log-log plot of their data, and then reports as the exponent of a new power law the slope they got from doing a least-squares linear fit, I'd at least not grumble. If my colleagues had gone to statistics textbooks and looked up how to estimate the parameters of a Pareto distribution, I'd be a happier man. If any of them had actually tested the hypothesis that they had a power law against alternatives like stretched exponentials, or especially log-normals, I'd think the millennium was at hand. (If you want to know how to do these things, please read this paper, whose merits are entirely due to my co-authors.) The situation for 1/f noise is not so dire, but there have been and still are plenty of abuses, starting with the fact that simply taking the fast Fourier transform of the autocovariance function does not give you a reliable estimate of the power spectrum, particularly in the tails. (On that point, see, for instance, Honerkamp.)

See also: Chaos and Dynamical Systems; Complex Networks; Self-Organized Criticality; Time Series; Tsallis Statistics

#

Branching Processes

A class of stochastic process important as models in genetics and population biology, chemical kinetics, and filtering. The basic idea is that there are a number of objects, often called particles, which, in some random fashion, reproduce ("branch") and die out; they can be of multiple types and occupy differing spatial locations. They can pursue their trajectories and their biographies either independently, or with some kind of statistical dependence across particles.

The most basic version has one type of particle, and no spatial considerations. At each time step, each parrticle gives rise to a random number of offspring; the distribution of offspring is fixed, and the number is independent across time-steps and across lineages (IID). This is the so-called Galton-Watson branching process. Galton introduced it as a model of the survival of (patrilneal) family names, so that only male offspring counted; he required the distribution of time until a given lineage went extinct. This was provided almost immediately by Watson, in a very elegant use of the method of generating functions, which is, itself, reproduced in probability textbooks down to the present day. (However, when I first encoutnered the problem, in a probability class, the teacher presented it as one about the survival of matrilineal lineages, defined by inheritance of mitochondrial DNA. Whether this was conscious subversion of the patriarchy, or just a reflection of the changing scientific interests between the 1890s and the 1990s, I couldn't say.)

See also: Epidemiology; Social Contagion

To read:

  • David Assaf, Larry Goldstein and Ester Samuel-Cahn, "An unexpected connection between branching processes and optimal stopping", math.PR/0510587 = Journal of Applied Probability 37 (2000): 613--6 [This sounds like a nice pedagogical topic for a course in stochastic processes. I teach a course in stochastic processes....]
  • Michael Assaf and Baruch Meerson, "Spectral Theory of Metastability and Extinction in Birth-Death Systems", Physical Review Letters 97 (2006): 200602 = cond-mat/0610415
  • Krishna B. Athreya, Branching Processes
  • K. B. Athreya, A.P. Ghosh, S. Sethuraman, "Growth of preferential attachment random graphs via continuous-time branching processes", math.PR/0701649
  • Ellen Baake, Hans-Otto Georgii, "Mutation, selection, and ancestry in branching models: a variational approach", q-bio.PE/0611018
  • Romulus Breban, Raffaele Vardavas and Sally Blower, "Linking population-level models with growing networks: A class of epidemic models", Physical Review E 72 (2005): 046110
  • Nicolas Champagnat, Régis Ferrière, Sylvie Méléar, "Individual-based probabilistic models of adaptive evolution and various scaling approximations", math.PR/0510453
  • Charles R. Doering, Khachik V. Sargsyan and Leonard M. Sander, "Extinction times for birth-death processes: exact results, continuum asymptotics, and the failure of the Fokker-Planck approximation", q-bio/0401016
  • Pierre Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems [This looks really, really cool]
  • Janos Englander, "Branching diffusions, superdiffusions and random media", Probability Surveys 4 (2007): 303--364, arxiv:0710.0236
  • Benjamin Golub and Matthew O. Jackson, "Using selection bias to explain the observed structure of Internet diffusions", Proceedings of the National Academy of Sciences (USA) 107 (2010): 10833--10836
  • Vicenc Gomez, Hilbert J. Kappen and Andreas Kaltenbrunner, "Modeling the structure and evolution of discussion cascades", arxiv:1011.0673
  • P. Haccou et al., Branching Processes: Variation, Growth, and Extinction of Populations
  • Jose Luis Iribarren and Esteban Moro, "Branching Dynamics of Viral Information Spreading", <cite>Physical Review E 84 (2011): 046116
  • Predrag R. Jelenkovic, Jian Tan, "Modulated Branching Processes, Origins of Power Laws and Queueing Duality", 0709.4297
  • Junghyo Jo, Jean-Yves Fortin, M. Y. Choi, "Weibull-type limiting distribution for replicative systems", Physical Review E 83 (2011): 031123, arxiv:1103.3038
  • Jean-Francois Le Gall, Spatial Branching Processes, Random Snakes and Partial Differential Equations
  • Brendan P. M. McCabe1, Gael M. Martin2, David Harris3, "Efficient probabilistic forecasts for counts", Journal of the Royal Statistical Society B 73 (2011): 253--272
  • Sebastian Müller, "Strong recurrence for branching Markov chains", arxiv:0710.4651
  • Victor M. Panaretos, "Partially observed branching processes for stochastic epidemics", Journal of Mathematical Biology 54 (2007): 645--668
  • David Sankoff, "Branching Processes with Terminal Types: Application to Context-Free Grammars", Journal of Applied Probability 8 (1971): 233--240 [JSTOR]
  • D. Sornette and S. Utkin, "Limits of declustering methods for disentangling exogenous from endogenous events in time series with foreshocks, main shocks, and aftershocks", Physical Review E 79 (2009): 061110, arxiv:0903.3217 #

    Complex Networks

    Having written a whole pop-sci article about these things (see below), I won't explain them at all here. This notebook is more of a placeholder than usual.

    Stuff I should learn more about: structural complexity measures for graphs and ensembles of random graphs; Gibbs measures for equilibrium ensembles of graphs; Markovian graphs. Why does it seem like the edges are the important random variables, rather than the nodes?

    Data analysis in general and community discovery in particular get their own notebooks. So does the connection between network topology and synchronization. The "homophily or influence?" problem.

    See also: Biochemical Network Evolution; Ecology; Neuroscience; Signal Transduction, Gene Regulation and Control of Metabolism; Social Networks; Sociology of Science; Statistical Mechanics; Synchronization

    #

    Evolution (of Organisms)

    [A proper discussion of evolution will appear here Any Time Now.]

    Issues in evolution proper: adaptation; complexity; developmental constraints and the evolution of development; ecology and co-evolution; genetics; sociobiology (in non-human beasties; in human beings, and what exactly it can and cannot account for); units of selection controversies (genes [Dawkins, Maynard Smith, Williams] vs. gene-complexes [Lewontin, sorta] vs. organisms [Williams the first time around] vs. groups [Sober?]) and group-selection arguments (when can traits which benefit a higher level of selection at the expense of the lower ones evolve? Probably never; the higher level entities don't have enough coherence and persistence to act as replicators).

    Query: What is known about the asymptotic distribution of the population under (discrete-time) replicator dynamics? What if the space of types in the replicator dynamics is infinite-dimensional? Or the fitness function is subject to stochastic shocks? Or both? (This now has its own notebook.)

    Extensions of evolution: to brain function; to computer programming ; to culture (memetics); to economics; to epistemology; to psychology.

    Mathematical modeling: classical population genetics à la Fisher, Haldane and Wright, and its extensions via dynamics; game theory à la John Maynard Smith. Connections to physics. Agent-based modeling.

    Challenges to neo-Darwinism: Here, as usual, my inclinations are conservative, in that I really don't see what's wrong with the orthodox theory. In any case, there don't seem to be any real alternatives yet advanced. (Neutral mutations by definition explain the origins of neither adaptations nor species.)