Error: I'm afraid this is the first I've heard of a "Rss" flavoured Blosxom. Try dropping the "/+Rss" bit from the end of the URL.

Tue, 24 Nov 2009

Mysteries

I've spun off the list of recommendations into a separate page. True crime goes here too, for want of a better spot.

#

Recommended Mystery Novels

Including true crime and spy stories, for lack of better places to put them.

See also: Mysteries to read.

#

Sat, 21 Nov 2009

Evolutionary Economics

See also Learning in Games; Memes; QWERTY. The connection to institutional economics is something I want to understand better, but then, I need to learn a lot more about institutionalism.

#

Evolution (of Organisms)

[A proper discussion of evolution will appear here Any Time Now.]

Issues in evolution proper: adaptation; complexity; developmental constraints and the evolution of development; ecology and co-evolution; genetics; sociobiology (in non-human beasties; in human beings, and what exactly it can and cannot account for); units of selection controversies (genes [Dawkins, Maynard Smith, Williams] vs. gene-complexes [Lewontin, sorta] vs. organisms [Williams the first time around] vs. groups [Sober?]) and group-selection arguments (when can traits which benefit a higher level of selection at the expense of the lower ones evolve? Probably never; the higher level entities don't have enough coherence and persistence to act as replicators).

Query: What is known about the asymptotic distribution of the population under (discrete-time) replicator dynamics? What if the space of types in the replicator dynamics is infinite-dimensional? Or the fitness function is subject to stochastic shocks? Or both? (This now has its own notebook.)

Extensions of evolution: to brain function; to computer programming ; to culture (memetics); to economics; to epistemology; to psychology.

Mathematical modeling: classical population genetics à la Fisher, Haldane and Wright, and its extensions via dynamics; game theory à la John Maynard Smith. Connections to physics. Agent-based modeling.

Challenges to neo-Darwinism: Here, as usual, my inclinations are conservative, in that I really don't see what's wrong with the orthodox theory. In any case, there don't seem to be any real alternatives yet advanced. (Neutral mutations by definition explain the origins of neither adaptations nor species.)

#

The Left

The right-thinking people who move us forward.

Whigs, philosophes, Liberals, Radicals, Leftists, Progressives, etc. History and views; causes of our present decrepitude and ways to fix it.

See also: Cultural Criticism; Economics; the Enlightenment; Environmentalism; Feminism; Socialism; Revolutions and Revolutionaries; the Right; Unions; the Welfare State

#

Learning in Games

See also Collective Cognition; Evolutionary Economics; Machine Learning, Statistical Inference and Induction; the Minority Game; Sequential Decisions Under Uncertainty; Universal Prediction Algorithms

#

Economics

I have always felt a certain horror of political economists, since I heard one of them say that he feared the famine of 1848 in Ireland would not kill more than a million people, and that would scarcely be enough to do much good.
---Attrib. to Benjamin Jowett

History of markets. QWERTY. Other parts of evolutionary economics. Arrow-Debreu model of general equilibrium. Market failure. Input-output models. Planning and regulation, esp. during the World Wars. Economic policy of US, Europe, East Asia, 3rd World. Development economics, growth theory. Corporations. Finance. Claims of analogies between physics and conventional economics (whether with laudatory or debunking intent) --- not to be confused with applying methods of theoretical physics to economic questions ("econophysics"). Agent-based modeling.

See also decision theory; historical materialism; information economy; institutional economics; socialism, market socialism

#

Thu, 19 Nov 2009

Causality and Causal Inference

There is unfortunately no accepted name for the scientific study of causality, or of methods for inferring it. "Etiology" suggests itself, but it's already taken...

Things I need to learn more about: Matched sampling methods.

See also: Computational Mechanics; Graphical Models; Machine Learning, Statistical Inference, and Induction

#

Graphical Models

A.k.a. causal models, causal graphs, Bayes graphs, Bayes networks, Bayesian networks. (Here "Bayes" is a metonym for "conditional probability". There are perfectly good frequentist interpretations of these models.) I'm sticking latent-variable and path-analysis models in here, too, because they all pretty much work the same way.

Everyone who takes basic statistics has it drilled into them that "correlation is not causation." (When I took psych. 1, the professor said he hoped that, if he were to come to us on our death-beds and prompt us with "Correlation is," we would all respond "not causation.") This is a problem, because one can infer correlation from data, and would like to be able to make inferences about causation. There are typically two ways out of this. One is to perform an experiment, preferably a randomized double-blind experiment, to eliminate accidental sources of correlation, common causes, etc. That's nice when you can do it, but impossible with supernovae, and not even easy with people. The other out is to look for correlations, say that of course they don't equal causations, and then act as if they did anyway. The technical names for this latter course of action are "linear regression" and "analysis of variance," and they form the core of applied quantitative social science, e.g., The Bell Curve.

Graphical models are, in part, a way of escaping from this impasse.

The basic idea is as follows. You have a bunch of variables, and you want to represent the causal relationships, or at least the probabilistic dependencies, between them. You do so by means of a graph. Each node in the graph stands for a variable. If variable A is a cause of B, then an arrow runs from A to B. If A is a cause of B, we also say that A is one of B's parents, and B one of A's children. If there is a causal path from A to B, then A is an ancestor of B, and B is a descendant of A. If a variable has no parents in the graph, it is exogenous, otherwise it is endogenous.

Part of what we mean by "cause" is that, when we know the immediate causes, the remoter causes are irrelevant --- given the parents, remoter ancestors don't matter. The standard example is that applying a flame to a piece of cotton will cause it to burn, whether the flame came from a match, spark, lighter or what-not. Probabilistically, this is a conditional indepedence property, or a Markov property: a variable is independent of its ancestors conditional on its parents. In fact, given its parents, its children, and its childrens' other parents, a variable is conditionally independent of all other variables. This is called the graphical or causal Markov property. When this holds, we can factor the joint probability distribution for all the variables into the product of the distribution of the exogenous variables, and the conditional distribution for each endogenous variable given its parents.

(You may be wondering what happens if A is a parent of B and B is a parent of A, as can happen when there is feedback between the variables. This leads to difficulties, traditionally dealt with by explicitly limiting the discussion to acyclic graphs. I shall follow this wise precedent here.)

Now, there are certain rules which let us infer conditional independence relations from each other. For instance, if X is independent of the combination of Y and W, given Z, then X is indepdent of Y alone given Z. So, if we have a graph which obeys the causal Markov condition, there are generally other conditional independence relations which follow from the basic ones. If these are the only conditional indepences which hold in the distribution, it is said to be faithful to the graph (or vice versa); otherwise it is unfaithful. For a graph to be Markov and unfaithful, there must (as it were) be an elaborate conspiracy among the conditional distributions, so elaborate that it will generally be destroyed by any change in any of those distributions. So faithfulness is a robust property.

This may sound pretty arcane, but that's just because it is arcane. The point, however, is that if you can make the three assumptions above (no causal cycles, Markov property, faithfulness), you're in business in a really remarkable way. There are very powerful statistical techniques that will let you infer the causal structure connecting your variables. This comes in two flavors. One is the Bayesian way: cook up a prior distribution over all possible causal graphs; compute the likelihood of the data under each graph; update your distribution over graphs; iterate. This is generally computationally intractable, assuming you can come up with a meaningful prior in the first place. The other approach is to use tests for conditional independence to eliminate possible connections between variables, and so to narrow down the range of candidate structures; it is basically frequentist, and can be shown, under a broad range of circumstances, to be asymptotically reliable.

Once you have your causal graph --- whether through estimation or through simply being handed one --- you can do lots of great things with it, like predict the effects of manipulating some of the variables, or make backward inferences from effects to causes. Of course, if the graph is big, doing the necessary calculations can be very troublesome in itself, and so people work on approximation methods and even ways of doing statistical inference on models of statistical distributions...

It's probably obvious I think this is incredibly neat, and even one of the most important ideas to come out of machine learning. Of course it doesn't really solve the problem of establishing causal relations, in the way Hume objected to; it says, assuming there are causal relations, of a certain stochastic form, and that these are stable, then they can be learned. But that, and the more general questions of what we ought to mean by "cause", deserve a notebook of their own.

Things I want to understand better: frequentist inference procedures. Computational learning theory for graphical models (the paper by Janzing and Herrmann is good). How to treat systems with feedback? How to treat dynamical systems and time series? How does all of this fit together with computational mechanics?

Not even a conjecture. Back in the 1960s, Chow and Liu (reference below) gave a polynomial algorithm for finding the best approximation to a global joint probability distribution using only pairwise interactions among the variables, i.e., the one which minimized the Kullback-Leibler divergence between the true and the approximating distribution. I have read that extending this to even three-way interactions is NP, though I don't know if it's NP-complete. (1) How is the intractability result established? (2) Is this the same as the computational phase transition one finds in going from 2-SAT to 3-SAT, where the critical point is at two-point-something SAT? (Presumably the answer to (1) would shed some light on this.) (3) Even if not, is there an analogous phase transition, perhaps in a different universality class? (Update in 2009, several years later: Bento and Montanari, below, sounds relevant, but I haven't read it yet.)

(Thanks to Gustavo Lacerda for pointing out a goof.) #

Regression, especially Nonparametric Regression

"Regression", in statistical jargon, is the problem of guessing the average level of some quantitative response variable from various predictor variables.

Linear regression is perhaps the single most common quantitative tool in economics, sociology, and many other fields; it's certainly the most common use of statistics. (Analysis of variance, arguably more common in psychology and biology, is a disguised form of regression.) While linear regression deserves a place in statistics, that place should be nowhere near as large and prominent as it currently is. There are very few situations where we actually have scientific support for linear models. Fortunately, very flexible nonlinear regression methods now exist, and from the user's point of view are just as easy as linear regression, and at least as insightful. (Regression trees and additive models, in particular, are just as interpretable.) At the very least, if you do have a particular functional form in mind for the regression, linear or otherwise, you should use a non-parametric regression to test the adequacy of that form.

From a technical point of view, the main drawback of modern regression methods is that their extra flexibility comes at the price of less "efficiency" — estimates converge more slowly, so you have less precision for the same amount of data. There are some situations where you'd prefer to have more precise estimates from a bad model than less precise estimates from a model which doesn't make systematic errors, but I don't think that's what most users of linear regression are chosing to do; they're just taught to type lm rather than gam. In this day and age, though, I don't understand why not.

(Of course, for the statistician, a lot of the more flexible regression methods look more or less like linear regression in some disguised form, because fundamentally all it does is projection. So it's not crazy to make it a foundational topic for statisticians. We should not, however, give the rest of the world the impression that the hat matrix is the source of all knowledge.)

The use of regression, linear or otherwise, for causal inference, rather than prediction, is a different, and far more sordid, story.

See also: Computational Statistics; Data Mining; Learning Theory; Model Selection; Neural Nets; Social Science Methodology; What Is the Right Null Model for Linear Regression?

#

Time Series, or Statistics for Stochastic Processes and Dynamical Systems

Rates of convergence of estimators; confidence intervals, analogs to VC-dimension results (see Meir's paper below). Large deviation techniques; why are large deviation rate functionals, when they exist, generally relative entropies? Prediction schemes. Are there universal schemes which do not demand exponentially growing volumes of data? Can any of the "universal algorithm" schemes actually be used for anything?

If you have an ergodic process, then the sample-path mean for any nice statistic you care to measure will, almost surely, converge to the distributional mean. This is even true of trajectory probabilities (i.e., if you want to know the probability of a certain finite-length trajectory, simply count how often it happens.) So "sit and count" is a reliable and consistent statistical procedure. If the process mixes sufficiently quickly, the rate of convergence might even be respectable. But this doesn't say anything about the efficiency of such procedures, which is surely a consideration. And what do you do for non-ergodic processes? (Take multiple runs and hope they're telling you about different ergodic components?) Non-stationary, even?

I need to learn more about frequency-domain approaches; despite being raised as a physicist, I find the time domain much more natural. After all, the frequency domain is effectively just one choice of a function basis, and there are infinitely many others, which might in some sense be more appropriate to the process at hand. But that's at least in part a rationalization against having to learn more math.

LSE econometrics and its "general-to-specific" modeling procedure is very interesting, and I think possibly even related to stuff I've done, but I need to understand it much better than I do.

(This notebook probably needs subdivision.)

See also: Control Theory; Dynamical Systems; Ergodic Theory; Filtering, State Estimation and Signal Processing; Grammatical Inference; Information Theory; Machine Learning, Statistical Inference and Induction; Markov Models and Hidden Markov Models; Neural Coding; Power Law Distributions, 1/f Noise and Long-Memory Processes; Recurrence Times of Stochastic Processes (also Hitting, Waiting, and First-Passage Times) Sequential Decisions Under Uncertainty; State-Space Reconstruction; Statistical Learning Theory with Dependent Data; Statistics; Stochastic Processes; Symbolic Dynamics; Universal Prediction Algorithms

#

Model Selection

(Reader, please make your own suitably awful pun about the different senses of "model selection" here, as a discouragement to those finding this page through prurient searching. Thank you.)

In statistics and machine learning, "model selection" is the problem of picking among different mathematical models which all purport to describe the same data set. This notebook will not (for now) give advice on it; as usual, it's more of a place to organize my thoughts and references...

Classification of approaches to model selection (probably not really exhaustive but I can't think of others, right now):

Direct optimization of some measure of goodness of fit or risk on training data.
Seems implicit in a lot of work which points to marginal improvements in "the proportion of variance explained", mis-classification rates, "perplexity", etc. Often, also, a recipe for over-fitting and chasing snarks. What's wanted is (almost always) some way of measuring the ability to generalize to new data, and in-sample performance is a biased estimate of this. Still, with enough data, if the gods of ergodicity are kind, in-sample performance is representative of generalization performance, so perhaps this will work asymptotically, though in many cases the researcher will never even glimpse Asymptopia across the Jordan.
Optimize fit with model-dependent penalty
Add on a term to each model which supposed indicates its ability to over-fit. (Adjusted R^2, AIC, BIC, ..., all do this in terms of the number of parameters.) Sounds reasonable, but I wonder how many actually work better, in practice, than direct optimization. (See Domingos for some depressing evidence on this score.)
Classical two-part minimum description length methods were penalties; I don't yet understand one-part MDL.
Penalties which depend on the model class
Measure the capacity of a class of models to over-fit; penalize all models in that class accordingly, regardless of their individual properties. Outstanding example: Vapnik's "structural risk minimization" (provably consistent under some circumstances). Only sporadically coincides with *IC-type penalties based on the number of parameters.
Cross-validation
Estimate the ability to generalize to different data by, in fact, using different data. Maybe the "industry standard" of machine learning. Query, how are we to know how much different data to use?
Query, how are we to cross-validate when we have complex, relational data? That is, I understand how to do it for independent samples, and I even understand how to do it for time series, but I do not understand how to do it for networks, and I don't think I am alone in this. (Well, I understand how to do it for Erdos-Renyi networks, because that's back to independent samples...)
The method of sieves
Directly optimize the fit, but within a constrained class of models; relax the constraint as the amount of data grows. If the constraint is relaxed slowly enough, should converge on the truth. (Ordinary parametric inference, within a single model class, is a limiting case where the constraint is relaxed infinitely slowly, and we converge on the pseudo-truth within that class [provided we have a consistent estimator].)
Encompassing models
The sampling distribution of any estimator of any model class is a function of the true distribution. If the true model clss has been well-estimated, it should be able to predict what other, wrong model classes will estimate, but not vice versa. In this sense the true model class "encompasses the predictions" of the wrong ones. ("Truth is the criterion both of itself and of error.")
General or covering models
Come up with a single model class which includes all the interesting model classes as special cases; do ordinary estimation within it. Getting a consistent estimator of the additional parameters this introduces is often non-trivial, and interpretability can be a problem.
Model averaging
Don't try to pick the best or correct model; use them all with different weights. Chose the weighting scheme so that if one is best, it will tend to be more and more influential. Often I think the improvement is not so much from using multiple models as from smoothing, since estimates of the single best model are going to be more noisy than estimates of a bunch of models which are all pretty good. (This leads to ensemble methods.)
Adequacy testing
The correct model should be able to encode the data as uniform IID noise. Test whether "residuals", in the appropriate sense, are IID uniform. Reject models which can't hack it. Possibly none of the models on offer is adequate; this, too, is informative. Or: models make specific probabilistic assumptions (IID Gaussian noise, for example); test those. Mis-specification testing.

The machine-learning-ish literature on model selection doesn't seem to ever talk about setting up experiments to select among models; or do I just not read the right papers there? (The statistical literature on experimental design tends to talk about "model discrimination" rather than "model selection".)

#

Ensemble Methods in Machine Learning

Boosting, bagging, binning, stacking, mixtures of experts, ...

Value of diversity.

See also: Collective Cognition; Learning Theory; Model Selection

#

Sat, 14 Nov 2009

Assortative Social Networks and Neutral Cultural Evolution

It is a common-place observation that there are strong relationships between cultural traits and social attributes; that different social groups accept and transmit different bits of culture. Most attempts to explain this from within the social sciences (emphatically including historical materialism and its variants) argue that this is due to some causal influence of social organization on culture. ("Social being determines consciousness" --- or, once the Hegelian gas has been released, social life shapes thought.) In these views, culture varies with social position because it's an adaptation to social position, or a reflection thereof, or an expression thereof. However, it is not clear to me that this can only be explained by a causal linkage.

A simple model to test this would be as follows. Imagine a population where each individual has a couple of social traits, which can take discrete values, and a cultural trait, which can likewise take a number of discrete values. Social traits are fixed. Now form a social network that's assortative, i.e., two individuals are more likely to be directly linked the more social traits they have in common. The cultural trait is variable over time. We start with some initial random distribution, but then, at each point in time, randomly pick one individual, who randomly copies one of their neighbors. Thus, culture is completely socially-neutral, and every cultural trait is just as well adapted as every other. My prediction is that, for reasonable-looking assortative networks, we'll see a good degree of correlation between social and cultural traits, just because people will be mostly learning from those close to them socially.

A slight refinement would be to make people uniformly more likely to adopt certain values of the cultural trait than others, independently of their social position. Then I predict that the less-popular cultural values will be concentrated in the smaller sub-networks.

(One could argue that this is still "social position" shaping thought, namely one's position in the social network. But now network structure screens-off and renders causally irrelevant the content of that social position.)

I hasten to add that such a model would be perfectly compatible with the pious hope that people have good reasons for their actions and beliefs; all that's really assumed is that there is no systematic relation between those reasons and social position. (So I'm not denying agency, rationality, etc.)

Needless to say, this would massively complicate the interpretation of opinion surveys. The typical practice of regressing responses on attributes of the responders will give you results which are weird hybrids of actual links between social status and beliefs, and the residue of diffusion.

The day after writing this, I found Hidalgo, Claro and Marquet's "Simple Dynamics on Complex Networks" (cond-mat/0411295). This looks at exactly the kind of random copying dynamic I have in mind, but divides the network into "guilds", in which all members have the same in-degree. Their surprising (to me) result is that, in equilibrium, the distribution of states (i.e., cultural traits) has to be same for all guilds. However, their guilds do not, in general, correspond to socially-defined groups, so I still have some hope my intuition is not totally and completely wrong.

Update, 21 March 2005: I should also mention (now that I've read it) V. Sood and S. Redner, "Voter Model on Heterogeneous Graphs", cond-mat/0412599 (= PRL 94 (2005): 178701). This paper's starting point is the easily-seen fact that, under the pure case of the copy-a-random-neighbor dynamics I'm considering (and which is one of several very different things called "the voter model"), everyone must come to share the same opinion. That is, the consensus states are absorbing states. Sood and Redner try to calculate the mean time to consensus as a function of properties of the social network. This is going to be useful to me, but it's not quite the same thing.

While I'm updating this, I should maybe say expand on what I hinted at above, about network structure "screening off" social status from cultural traits. There are several ways of expressing this formally, but the one I have in mind relies on our ability to decompose networks into "communities", sub-networks whose members are more closely tied to one another than to outsiders. (There are many ways of doing that, too, but I like the Newman-Girvan approach, not just because Mark is a good friend whom I can persuade to share code, but also because their algorithms make sense.) So, formally, what I'm proposing is that the dynamics I'm considering will (1) lead to strong statistical dependence between social position and cultural traits, but (2) social position and cultural traits will be (nearly) independent, conditional on community membership. (These statistical dependencies can be measured in any convenient way, e.g. through mutual information, or perhaps chi-squared to get p-values.) Of course, in the pure-copying case, this will be a transient effect, since ultimately everyone will share the same opinion. One thing I'm not sure of yet is whether it's better to just look at the transients (which Sood and Redner indicate might be very long), or to introduce some amount of perturbation (e.g., through copying errors) which will lead to a non-trivial statistical equilibrium. Maybe I should just try both and see.

22 April 2005: In conversation, Eric Smith suggests that Bill Labov's work on phonological changes in American English might have enough data to actually test such a neutral model.

Update, 16 October 2007: It works. Two social types (equiprobably), binary cultural trait (initially equiprobable). Nodes form ties with probability p if they are of the same type and probability q if they are of different types. Cultural traits change by random copying, as outlined above. I've plotted the chi-squared statistic for the association between social type and cultural trait as a function of time. The black line is a run where p=0.09, q=0.01, and the assortativity coefficient of the resulting network was r = 0.80. The grey line is a run where p=q=0.05, giving a graph with r = 0.045.

#

Social Networks

See also: Community Discovery; Complex Networks; Institutions and Organizations; Network Data Analysis; Networks of Political Actors; Sociology; Sociology of Science; Terrorism

#

Sociology of Science

Raymond Aron says somewhere that "science is inseparable from the republic of scholars." This is substantially true, though I can imagine odd exceptions. (R. Crusoe, FRS, could have done astronomy or botany or algebra before meeting Friday, though I don't think he could have invented them.) In any event, science is an activity which groups do vastly better, and easier, than isolated individuals. In saying this, I trust I shan't have to defend myself against suspicion of social-constructionist heresy. The practical recognition of this truth goes back to the founders of the first academies during the scientific revolution, and it was explicitly recognized in the Enlightenment, for instance in d'Alembert's "Preliminary Discourse" to the Encyclopedie. An investigation into science which doesn't recognize, and account for, its social nature is on all fours with one which doesn't recognize, and account for, the fact that it produces reliable knowledge, which is to say much like an investigation of agriculture which doesn't realize it produces food. These should be "every schoolchild knows" truths, though sadly they're anything but.

Every schoolchild also knows that differences in social organization don't completely explain why statistical mechanics is fruitful, but UFOlogy is not — that matter really is made out of molecules, and people really aren't abducted by aliens, has, to say the least, something to do with it. But the sciences started from beliefs about as wacko as anything today's kooks can produce — say, alchemy — but haven't stayed there, whereas the kooks have, and this deserves explanation. More: a proper understanding of this could help improve scientific method, something eagerly to be desired.

Of course there are already lots of people engaged in this undertaking; sociology of science is, in general, more sensible than most scientists suppose. (Also more sensible than most of the rest of sociology, but that's another story for another time.) Even the noise in the management literature recently about "learning organizations" and the like is not unrelated, and might even be promising. (On the one hand, lots of problems get cracked once people see that lots of money could be made from the solution. On the other hand, we are talking about the management witch-doctors.) There are, however, two potentially fruitful lines of research which nobody, so far as I know, has bothered to undertake. One is straightforward comparative sociology, contrasting genuine intellectual disciplines (including, besides the natural sciences, things like history or philology) with the half-disciplines, the pseudosciences, and the simple crackpots. The other is to take some of the descriptions of how scientists act and interact with each other from the existing sociological literature, throw them on the computer, and see if they produce something which looks like the science we know; also if they produce the results their authors claim they do. (My suspicion is that most of them will not.)

See also: Collective Cognition; Evolutionary Epistemology; History of Science; Science; Scientific Method

#

Community Discovery Methods for Complex Networks

Given: a network, especially a large one, directed or not, weighted or not. Desired: a sensible decomposition of the graph into sub-graphs, where in some reasonable sense the nodes in each sub-graph have more to do with each other than with outsiders, i.e., form communities. This is also called "module detection".

This seems like a really useful idea to apply to problems I'm interested in, in neural synchronization; also a place where there could stand to be more interchange between statistics and complex-network-wallahs.

Some of the methods in this area remind me of stuff Christopher Alexander did in his 1964 book Notes on the Synthesis of Form, but it's been a long time since I read that, so my memory may be faulty.

See also: Ecology; Neuroscience; Signal Transduction, Gene Regulation and Control of Metabolism; Social Networks; Sociology of Science; Statistical Mechanics; Synchronization

#

The United States Congress, How It Works and For Whom

I'm starting to do research on this, Heaven help me, as an exercise in network analysis. See also Campaign Finance; Democracy; Political Elites; Networks of Political Actors; Political Decision Making

#

Networks of Political Actors

One of the things I'm interested in is understanding how network forms of organization emerge among political actors, how they affect decision-making, and how they interact with other social networks and institutions. I have a ridiculously over-ambitious research project, about networks of cronyism, that I'd like to do, but in the meanwhile I;m settling for small steps. Presumably, like other social networks, they serve as platforms for information exchange, deliberation, and other forms of collective cognition. Formal political organizations can also serve these functions, but it seems easier to make organizations democratically accountable than it is networks --- is this a problem? How does their structure compare to that of other networks?

#

Fri, 13 Nov 2009

Phase Transitions and Critical Phenomena

One of the central areas of statistical mechanics for the last, oh, forty years, to the point where it has seriously shaped --- one might even say, warpped --- how those of us trained in that tradition look at the world in general. (See power laws and especially self-organized criticality.)

Things I want to understand better. Rigorously separated phases seem to only exist in infinite-system limits; what are the large-but-finite regimes like? Connections between phase transitions and changes in the topology of the phase space. Do there exist ways of deducing the order parameter from either microscopic Hamiltonians or from macroscopic observations? Is there a way of detecting phase transitions from macroscopic observables other than the order parameter and the thermodynamic potential?

Why are there so few fixed points to the renormalization group?

Connections between power law distributions and critical fluctuations. While I understand the physical arguments for why we see power-law-distributed fluctuations at the critical point, I find myself wanting a more probabilistic explanation as well. A crude sketch would go as follows. Far from the critical point, the microscopic dynamics are rapidly mixing in space and time --- and mixing in the technical, ergodic theory sense, so that the central limit theorem applies, and averages over spatio-temporal regions large compared to the mixing scales are approximately Gaussian. (Cf. Rosenblatt, 1956.) As one approaches the critical point, however, giant, correlated fluctuations begin to appear, i.e., the mixing scales diverge, and one is dealing with a process with long-range memory (in both space and time). Under these circumstances, averaging can deliver a non-Gaussian but still self-similar distribution, which is where the power-law tails come from. The stable distributions, including the Gaussian, emerge from the central limit theorem for independent variables because they are unchanged under convolution (averaging) with themselves --- there are ways, in renormalization group theory, of trading off infinite variance (as in the non-Gaussian stable limits) for infinite range-correlation. This, I should understand better. (The review paper by Jona-Lasinio is a start, but does not leave me with enough intuition that I feel entirely comfortable with what's going on — in part, I think, because nobody entirely understands things.)

#

Foundations and History of Statistical Mechanics

Technical issues: things like, what exactly is a C* algebra? Role of large deviations.

Conceptual issues: Why is it legitimate to treat deterministic mechanical systems with many unstable degrees of freedom as stochastic processes? (My impulse is to appeal to ergodic theory.) When and why do we get convergence to equilibria characterized by only a few macroscopic degrees of freedom? (That sounds like a central limit theorem, some kind of result about how the large-scale limit is insensitive to all but a few aspects of the small scales.)

Historical issues: It's interesting to know how people have argued about this stuff.

See also: Statistical Mechanics; Nonequilibrium Statistical Mechanics; Maximum Entropy; Tsallis Statistics

#

Power Law Distributions, 1/f Noise, Long-Memory Time Series

Why do physicists care about power laws so much?

I'm probably not the best person to speak on behalf of our tribal obsessions (there was a long debate among the faculty at my thesis defense as to whether "this stuff is really physics"), but I'll do my best. There are two parts to this: power-law decay of correlations, and power-law size distributions. The link is tenuous, at best, but they tend to get run together in our heads, so I'll treat them both here.

The reason we care about power law correlations is that we're conditioned to think they're a sign of something interesting and complicated happening. The first step is to convince ourselves that in boring situations, we don't see power laws. This is fairly easy: there are pretty good and rather generic arguments which say that systems in thermodynamic equilibrium, i.e. boring ones, should have correlations which decay exponentially over space and time; the reciprocals of the decay rates are the correlation length and the correlation time, and say how big a typical fluctuation should be. This is roughly first-semester graduate statistical mechanics. (You can find those arguments in, say, volume one of Landau and Lifshitz's Statistical Physics.)

Second semester graduate stat. mech. is where those arguments break down --- either for systems which are far from equilibrium (e.g., turbulent flows), or in equilibrium but very close to a critical point (e.g., the transition from a solid to liquid phase, or from a non-magnetic phase to a magnetized one). Phase transitions have fluctuations which decay like power laws, and many non-equilibrium systems do too. (Again, for phase transitions, Landau and Lifshitz has a good discussion.) If you're a statistical physicist, phase transitions and non-equilibrium processes define the terms "complex" and "interesting" --- especially phase transitions, since we've spent the last forty years or so developing a very successful theory of critical phenomena. Accordingly, whenever we see power law correlations, we assume there must be something complex and interesting going on to produce them. (If this sounds like the fallacy of affirming the consequent, that's because it is.) By a kind of transitivity, this makes power laws interesting in themselves.

Since, as physicists, we're generally more comfortable working in the frequency domain than the time domain, we often transform the autocorrelation function into the Fourier spectrum. A power-law decay for the correlations as a function of time translates into a power-law decay of the spectrum as a function of frequency, so this is also called "1/f noise".

Similarly for power-law distributions. A simple use of the Einstein fluctuation formula says that thermodynamic variables will have Gaussian distributions with the equilibrium value as their mean. (The usual version of this argument is not very precise.) We're also used to seeing exponential distributions, as the probabilities of microscopic states. Other distributions weird us out. Power-law distributions weird us out even more, because they seem to say there's no typical scale or size for the variable, whereas the exponential and the Gaussian cases both have natural scale parameters. There is a connection here with fractals, which also lack typical scales, but I don't feel up to going into that, and certainly a lot of the power laws physicists get excited about have no obvious connection to any kind of (approximate) fractal geometry. And there are lots of power law distributions in all kinds of data, especially social data --- that's why they're also called Pareto distributions, after the sociologist.

Physicists have devoted quite a bit of time over the last two decades to seizing on what look like power-laws in various non-physical sets of data, and trying to explain them in terms we're familiar with, especially phase transitions. (Thus "self-organized criticality".) So badly are we infatuated that there is now a huge, rapidly growing literature devoted to "Tsallis statistics" or "non-extensive thermodynamics", which is a recipe for modifying normal statistical mechanics so that it produces power law distributions; and this, so far as I can see, is its only good feature. (I will not attempt, here, to support that sweeping negative verdict on the work of many people who have more credentials and experience than I do.) This has not been one of our more successful undertakings, though the basic motivation --- "let's see what we can do!" --- is one I'm certainly in sympathy with.

There have been two problems with the efforts to explain all power laws using the things statistical physicists know. One is that (to mangle Kipling) there turn out to be nine and sixty ways of constructing power laws, and every single one of them is right, in that it does indeed produce a power law. Power laws turn out to result from a kind of central limit theorem for multiplicative growth processes, an observation which apparently dates back to Herbert Simon, and which has been rediscovered by a number of physicists (for instance, Sornette). Reed and Hughes have established an even more deflating explanation (see below). Now, just because these simple mechanisms exist, doesn't mean they explain any particular case, but it does mean that you can't legitimately argue "My favorite mechanism produces a power law; there is a power law here; it is very unlikely there would be a power law if my mechanism were not at work; therefore, it is reasonable to believe my mechanism is at work here." (Deborah Mayo would say that finding a power law does not constitute a severe test of your hypothesis.) You need to do "differential diagnosis", by identifying other, non-power-law consequences of your mechanism, which other possible explanations don't share. This, we hardly ever do.

Similarly for 1/f noise. Many different kinds of stochastic process, with no connection to critical phenomena, have power-law correlations. Econometricians and time-series analysts have studied them for quite a while, under the general heading of "long-memory" processes. You can get them from things as simple as a superposition of Gaussian autoregressive processes. (We have begun to awaken to this fact, under the heading of "fractional Brownian motion".)

The other problem with our efforts has been that a lot of the power-laws we've been trying to explain are not, in fact, power-laws. I should perhaps explain that statistical physicists are called that, not because we know a lot of statistics, but because we study the large-scaled, aggregated effects of the interactions of large numbers of particles, including, specifically, the effects which show up as fluctuations and noise. In doing this we learn, basically, nothing about drawing inferences from empirical data, beyond what we may remember about curve fitting and propagation of errors from our undergraduate lab courses. Some of us, naturally, do know a lot of statistics, and even teach it --- I might mention Josef Honerkamp's superb Stochastic Dynamical Systems. (Of course, that book is out of print and hardly ever cited...)

If I had, oh, let's say fifty dollars for every time I've seen a slide (or a preprint) where one of us physicists makes a log-log plot of their data, and then reports as the exponent of a new power law the slope they got from doing a least-squares linear fit, I'd at least not grumble. If my colleagues had gone to statistics textbooks and looked up how to estimate the parameters of a Pareto distribution, I'd be a happier man. If any of them had actually tested the hypothesis that they had a power law against alternatives like stretched exponentials, or especially log-normals, I'd think the millennium was at hand. (If you want to know how to do these things, please read this paper, whose merits are entirely due to my co-authors.) The situation for 1/f noise is not so dire, but there have been and still are plenty of abuses, starting with the fact that simply taking the fast Fourier transform of the autocovariance function does not give you a reliable estimate of the power spectrum, particularly in the tails. (On that point, see, for instance, Honerkamp.)

See also: Chaos and Dynamical Systems; Complex Networks; Self-Organized Criticality; Time Series; Tsallis Statistics

#

Neural Modeling and Data Analysis

Especially, but not exclusively, modeling of spike trains (which is important for neural coding, and overlaps therewith).

Things to investigate: How easy would it be to adapt spike-sorting algorithms to cluster or classify other kinds of time series? Easy or not, would there be any point?

See also: Neural Coding; Synchronization in Neural Systems; Neuroscience in general

#

The Enlightenment
Voltaire, Diderot, Hume, La Mettrie, Smith, Gibbon. Origins of the revolution, the Left. Relations to science, superstition, Romanticism, the industrial revolution. Connections and attitudes to classical antiquity, the Renaissance.

#

Frequentist Consistency of Bayesian Procedures

"Bayesian consistency" is usually taken to mean showing that, under Bayesian updating, the posterior probability concentrates on the true model. That is, for every (measurable) set of hypotheses containing the truth, the posterior probability goes to 1. (In practice one shows that the posterior probability of any set not containing the truth goes to zero.) There is a basic result here, due to Doob, which essentially says that the Bayesian learner is consistent, except on a set of data of prior probability zero. That is, the Bayesian is subjectively certain they will converge on the truth. This is not as reassuring as one might wish, and showing Bayesian consistency under the true distribution is harder. In fact, it usually involves assumptions under which non-Bayes procedures will also converge. These are things like the existence of very powerful consistent hypothesis tests (an approach favored by Ghosal, van der Vaart, et al., supposedly going back to Le Cam), or, inspired by learning theory, constraints on the effective size of the hypothesis space which are gradually relaxed as the sample size grows (as in Barron et al.). If these assumptions do not hold, one can construct situations in which Bayesian procedures are inconsistent.

Concentration of the posterior around the truth is only a preliminary. One would also want to know that, say, the posterior mean converges, or even better that the predictive distribution converges. For many finite-dimensional problems, what's called the "Bernstein-von Mises theorem" basically says that the posterior mean and the maximum likelihood estimate converge, so if one works the other will too. This breaks down for infinite-dimensional problems.

(PAC-Bayesian results don't fit into this picture particularly neatly. Essentially, they say that if you find a set of classifiers which all classify correctly in-sample, and ask about the average out-of-sample performance, the bounds on the latter are tighter for big sets than for small ones. This is for the unmysterious reason that it takes a bigger coincidence for many bad classification rules to happen to all work on the training data than for a few bad rules to get lucky. The actual Bayesian machinery of posterior updating doesn't really come into play.)

I believe I have contributed a Result to this area, on what happens when the data are dependent and all the models are mis-specified, but some are more mis-specified than others.

Query: are there any situations where Bayesian methods are consistent but no non-Bayesian method is? (My recollection is that John Earman, in Bayes or Bust, provides a negative answer, but I forget how.)

#

Democracy

And science. Export from Europe. Indigenous outside Europe? (Yes: see Muhlberg.) In tribal and especially in nomadic cultures (like the proto-Indo-Europeans)? And non-European philosophies. Pluralism, secularism, liberty. Representative and direct. And telecommunications. Democratic deliberation as a mechanism for collective cognition.

#

Tue, 10 Nov 2009

Statistics

An application of probability, with intimate ties to machine learning, non-demonstrative inference and induction.

Since June 2005, I have been a (very, very junior) professor of statistics. This made me interested in how to teach it.

See also: Properties vs. principles in defining "good statistics"

Things I need to learn more about:
Dependent data
Statistical inference for stochastic processes, a.k.a. time-series analysis. Signal processing and filtering. Spatial statistics.
Model selection
Gets its own notebook.
Adapting statistical procedures to data without losing validity
Sequential inference, adaptive sampling, bandwidth selection.
Model discrimination
That is, designing experiments so as to discriminate between competing classes of model. Adaptation to data issues here, too.
Rates of convergence of estimators to true values
Empirical process theory. (Cf. some questions in ergodic theory).
Estimating distribution functions
And estimating entropies, or other functionals of distributions.
Non-parametric methods
Both those that are genuinely distribution-free, and those that would more accurately be mega-parametric (even infinitely-parametric) methods, such as neural networks
Regression
Resampling methods
Including distribution-free resampling methods, especially for dependent data
Sufficient statistics
Get their own notebook.
Decision theory
Conventional, and the sorts with some connection to how real decisions are made.
Graphical models
Monte Carlo and other simulation methods
"De-Bayesing"
Ways of taking Bayesian procedures and eliminating dependence on priors, either by replacing them by initial point-estimates, or by showing the prior doesn't matter, asymptotically or hopefully sooner. See: Frequentist consistency of Bayesian procedures.
Information Geometry
Partial identification of parametric statistical models
Causal Inference
Computational Statistics
Statistics of structured data
Grammatical Inference