Archives
Categories
Afghanistan and Central Asia
Anticontrarianism Bayes, Anti-Bayes Biology Complexity Corrupting the Young Creationism Cthulhiana Dialogues Enigmas of Chance Food Friday Cat Blogging Heard About Pittsburgh, PA Incestuous Amplification IQ Islam Learned Folly Linkage Mathematics Minds, Brains, and Neurons Modest Proposals Networks Philosophy Physics Pleasures of Detection, Portraits of Crime Postcards Power Laws Psychoceramica Scientifiction and Fantastica Self-Centered The Beloved Republic The Collective Use and Evolution of Concepts The Commonwealth of Letters The Continuing Crises The Dismal Science The Eternal Silence of These Infinite Spaces The Great Transformation The Natural Science of the Human Species The Progressive Forces The Running-Dogs of Reaction Writing for Antiquity
Self-Centered
Personal homepage
Professorial homepage News and Blogroll Research Notebooks My books (LibraryThing) Buy me more books! (wishlist) pinboard (bookmarks) tumblr (images) flickr (snapshots) dopplr (travel)
Books to Read While the Algae Grow in Your Fur
Books (etc.) I've read this month and
feel I can recommend (warning: I have no taste)
|
May 21, 2012If Peer Review Did Not Exist, We Would Have to Invent Something Very Like It to Serve Highly Similar Ends
Attention conservation notice: 1400 words on a friend's proposal to do away with peer review, written many weeks ago when there was actually some debate about this. Larry is writing about peer review (again), this time to advocate "A World Without Referees". Every scientist, of course, has day-dreamed about this, in a first-lets-kill-all-the-lawyers way, but Larry is serious, so let's treat this seriously. I'm not going to summarize his argument; it's short and you can and should go read it yourself. I think it helps, when thinking about this, to separate two functions peer-reviewed journals and conferences have traditionally served. One is spreading claims (dissemination), and the other is letting readers know about claims worthy of their attention (certification). Arxiv, or something like it, can take over dissemination handily. Making copies of papers is now very cheap and very fast, so we no longer have to be choosy about which ones we disseminate. In physics, this use of Arxiv is just as well-established as Larry says. In fact, one reason Arxiv was able to establish itself so rapidly and thoroughly among physicists was that they already had a well-entrenched culture of circulating preprints long before journal publication. What Arxiv did was make this public and universally accessible. But physicists still rely on journals for certification. People pay more attention to papers which come out in Physical Review Letters, or even just Physical Review E, than ones which are only on Arxiv. "Could it make it past peer review?" is used by many people as a filter to weed out the stuff which is too wrong or too unimportant to bother with. This doesn't work so well for those directly concerned with a particular research topic, but if something is only peripherally of interest, it makes a lot of sense. Even within a specialized research community, consisting entirely of experts who can evaluate new contributions on their own, there is a rankling inefficiency to the world without referees. Larry talks about spending a minute or two looking at new stats. papers on Arxiv every day. But everyone filtering Arxiv for themselves is going to get harder and harder as more potentially-relevant stuff gets put on it. I'm interested in information theory, so I've long looked at cs.IT, and it's become notably more time-consuming as that community has embraced the Arxiv. Yet within any given epistemic community, lots of people are going to be applying very similar filters. So the world-without-referees has an increasing amount o work being done by individuals, but a lot of that work is redundant. Efficiency, the division of labor, points to having a few people put their time into filtering, and the rest of us relying on it, even when in principle we could do the filtering ourselves. To be fair, of course, we should probably take this job in turns... So: if all papers get put on Arxiv, filtering becomes a big job, so efficiency pushes us towards having only some members of the research community do the filtering for the rest. We have re-invented something very much like peer review, purely so that our lives are not completely consumed by evaluating new papers, and we can actually get some work done. Larry's proposal for a world without referees also doesn't seem to take into account the needs of researchers to rely on findings in fields in which they are not experts, and so can't act as their own filters. (Or they could if they put in a few years in something else first.) If I need some result from neuroscience, or for that matter from topology, I do not have the time to spend becoming a neuroscientist or topologist, and it is an immense benefit to have institutions I can trust to tell me "these claims about cortical columns, or locally compact Hausdorff spaces, are at least not crazy". This is also a kind of filtering, and there is the same push, based on the division of labor, to rely on only some neuroscientists or topologists to do the filtering for outsiders (or all of them only some of the time), and again we have re-created something very much like refereeing. So: some form or forms of filtering is inevitable, and the forces pushing for a division of labor in filtering are very strong. I don't know of any reason to think that the current, historically-evolved peer review system is the best way of organizing this cognitive triage, but we're not going to avoid having some such system, nor should we want to. Different ways of organizing the work of filtering will have different costs and benefits, but we should be talking about those and those trade-offs, not hoping that we can just wish the problem away now that making copies is cheap1. It's not at all obvious, for instance, that attention-filtering for the internal benefit of members of a research community should be done in the same way as reliability-filtering for outsiders. But, to repeat, we are going to have filters and they are almost certainly going to involve a division of labor. Lenin, supposedly, said that "small production engenders capitalism and the bourgeoisie daily, hourly, spontaneously and on a mass scale" (Nove, The Economics of Feasible Socialism Revisited, p. 46). Whether he was right about the bourgeoisie or not, the rate of production of the scientific literature, the similarity of interests and standards with a community, and the need to rely on other field's findings are all doing to engender refereeing-like institutions, "daily, hourly, spontaneously and on a mass scale". I don't think Larry would go to the same lengths to get rid of referees that Lenin went to get rid of the bourgeoisie, but in any case the truly progressive course is not to suppress the old system by force, but to provide a superior alternative. Speaking personally, I am attracted to a scenario we might call "peer review among consenting adults". Let anyone put anything on Arxiv (modulo the usual crank-screen). But then let others create filtered versions, applying such standards of topic, rigor, applicability, writing quality, etc., as they please --- and be explicit about what those standards are. These can be layered as deep as their audience can support. Presumably the later filters would be intended for those further from active research in the area, and so would be less tolerant of false alarms, and more tolerant of missing possible discoveries, than the filters for those close to the work. But this could be an area for experiment, and for seeing what people actually find useful. This is, I take it, more or less what Paul Ginsparg proposes, and it has a lot to recommend it. Every contribution is available if anyone wants to read it, but no one is compelled to try to filter the whole flow of the scholarly literature unaided, and human intelligence can still be used to amplify interesting signals, or even to improve papers. Attractive as I find this idea, I am not saying it is historically inevitable, or even the best possible way of ordering these matters. The main point is that peer review does some very important jobs for the community of inquirers (whether or not it evolved to do them), and that if we want to get rid of it, it would be a good idea to have something else ready to do those jobs. [1]: For instance, many people have suggested that referees should have to take responsibility, in some way, for their reports, so that those who do sloppy or ignorant or merely-partisan work will be at least shamed. There is genuinely a lot to be said for this. But it does run into the conflicting demand that science should not be a respecter of persons --- if Grand Poo-Bah X writes a crappy paper, people should be able to call X on it, without fear of retribution or considering the (inevitable) internal politics of the discipline and the job-market. I do not know if there is a way to reconcile these, but that's one of the kind of trade-offs we have to consider as we try to re-design this institution. ^ Learned Folly; Kith and Kin; The Collective Use and Evolution of Concepts Posted by crshalizi at May 21, 2012 02:00 | permanent link
May 03, 2012Ten Years of Monster Raving Egomania and Utter Batshit Insanity
Sometimes, all you can do is quote verbatim* from your inbox: Date: Tue, 17 Apr 2012 09:31:57 -0400 From: Stephen Wolfram To: Cosma Shalizi Subject: 10-year followup on "A New Kind of Science" Next month it'll be 10 years since I published "A New Kind of Science" ... and I'm planning to take stock of the decade of commentary, feedback and follow-on work about the book that's appeared. My archives show that you wrote an early review of the book: http://www.cscs.umich.edu/~crshalizi/reviews/wolfram/ At the time reviews like yours appeared, most of the modern web apparatus for response and public discussion had not yet developed. But now it has, and there seems to be considerable interest in the community in me using that venue to give my responses and comments to early reviews. I'm writing to ask if there's more you'd like to add before I embark on my analysis in the next week or so. I'd like to take this opportunity to thank you for the work you put into writing a review of my book. I know it was a challenge to review a book of its size, especially quickly. I plan to read all reviews with forbearance, and hope that---especially leavened by the passage of a decade---useful intellectual points can be derived from discussing them. If you don't have anything to add to your early review, it'd be very helpful to know that as soon as possible. Thanks in advance for your help. -- Stephen Wolfram P.S. Nowadays you can find the whole book online at http://www.wolframscience.com/nksonline/toc.html If you'd like a new physical copy, just let me know and I can have it shipped... I wrote my my review in 2002 (though I didn't put it out until 2005). The idea that complex patterns can arise from simple rules was already old then, and has only become more commonplace since. A lot of interesting, substantive, specific science has been done on that theme in the ensuing decade. To this effort, neither Wolfram nor his book have contributed anything of any note. The one respect in which I was overly pessimistic is that I have not, in fact, had to spend much time "de-programming students [who] read A New Kind of Science before knowing any better" — but I get a rather different class of students these days than I did in 2002. Otherwise, and for the record, I do indeed still stand behind the review. Manual trackback: Hacker News; Wolfgang; Andrew Gelman *: I removed our e-mail addresses, because no one deserves spam. Posted by crshalizi at May 03, 2012 23:10 | permanent link
May 02, 2012Installing pcalg
Attention conservation notice: Boring details about getting finicky statistical software to work; or, please read the friendly manual. Some of my students are finding it difficult to install the R package pcalg; I share these instructions in case others are also in difficulty.
When I installed pcalg on my laptop two weeks ago, it was painless, because (1) I already had graphviz, and (2) I knew about BioConductor. (In fact, the R graphical interface on the Mac will switch between installing packages from CRAN and from BioConductor.) To check these instructions, I just now deleted all the packages from my computer and re-installed them, and everything worked; elapsed time, ten minutes, mostly downloading. Posted by crshalizi at May 02, 2012 21:30 | permanent link
May 01, 2012Final Exam (Advanced Data Analysis from an Elementary Point of View)
In which we are devoted to two problems of political economy, viz., strikes, and macroeconomic forecasting. Posted by crshalizi at May 01, 2012 10:31 | permanent link
Time Series I (Advanced Data Analysis from an Elementary Point of View)What time series are. Properties: autocorrelation or serial correlation; other notions of serial dependence; strong and weak stationarity. The correlation time and the world's simplest ergodic theorem; effective sample size. The meaning of ergodicity: a single increasing long time series becomes representative of the whole process. Conditional probability estimates; Markov models; the meaning of the Markov property. Autoregressive models, especially additive autoregressions; conditional variance estimates. Bootstrapping time series. Trends and de-trending. Posted by crshalizi at May 01, 2012 10:30 | permanent link
April 30, 2012Books to Read While the Algae Grow in Your Fur, April 2012
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Enigmas of Chance; Central Asia; Philosophy Posted by crshalizi at April 30, 2012 23:59 | permanent link
April 24, 2012Brought to You by the Letters D, A, and G (Advanced Data Analysis from an Elementary Point of View)
In which the arts of estimating causal effects from observational data are practiced on Sesame Street. Posted by crshalizi at April 24, 2012 10:31 | permanent link
Estimating Causal Effects from Observations (Advanced Data Analysis from an Elementary Point of View)
Estimating graphical models: substituting consistent estimators into the formulas for front and back door identification; average effects and regression; tricks to avoid estimating marginal distributions; propensity scores and matching and propensity scores as computational short-cuts in back-door adjustment. Instrumental variables estimation: the Wald estimator, two-stage least-squares. Summary recommendations for estimating causal effects. Reading: Notes, chapter 24 Posted by crshalizi at April 24, 2012 10:30 | permanent link
April 22, 2012Separated at Birth (Advanced Data Analysis from an Elementary Point of View)
In which we use graphical causal models to understand twin studies and variance components. Posted by crshalizi at April 22, 2012 12:00 | permanent link
April 21, 2012Identifying Causal Effects from Observations (Advanced Data Analysis from an Elementary Point of View)
Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Summary recommendations for identifying causal effects. Reading: Notes, chapter 23 Posted by crshalizi at April 21, 2012 12:00 | permanent link
April 20, 2012Just How Quickly Do We Forget?
Attention conservation notice: 2500+ words on estimating how quickly time series forget their own history. Only of interest if you care about the intersection of stochastic processes and statistical learning theory. Full of jargon, equations, log-rolling and self-promotion, yet utterly abstract. I promised to say something about the content of Daniel's thesis, so let me talk about two of his papers, which go into chapter 4; there is a short conference version and a long journal version.
Recall the world's simplest ergodic theorem: if \( X_t \) is a sequence of random variables with common expectation \( m \) and variance \( v \), and stationary covariance \( \mathrm{Cov}[X_t, X_{t+h}] = c_h \). Then the time average \( \overline{X}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{X_i} \) also has expectation \( m \), and the question is whether it converges on that expectation. The world's simplest ergodic theorem asserts that if the correlation time \[ T = \frac{\sum_{h=1}^{\infty}{|c_h|}}{v} < \infty \] then \[ \mathrm{Var}\left[ \overline{X}_n \right] \leq \frac{v}{n}(1+2T) \] Since, as I said, the expectation of \( \overline{X}_n \) is \( m \) and its variance is going to zero, we say that \( \overline{X}_n \rightarrow m \) "in mean square". From this, we can get a crude but often effective deviation inequality, using Chebyshev's inequality: \[ \Pr{\left(|\overline{X}_n - m| > \epsilon\right)} \leq \frac{v}{\epsilon^2}\frac{1+2T}{n} \] The meaning of the condition that the correlation time \( T \) be finite is that the correlations themselves have to trail off as we consider events which are widely separated in time — they don't ever have to be zero, but they do need to get smaller and smaller as the separation \( h \) grows. (One can actually weaken the requirement on the covariance function to just \( \lim_{n\rightarrow \infty}{\frac{1}{n}\sum_{h=1}^{n}{c_h}} = 0 \), but this would take us too far afield.) In fact, as these formulas show, the convergence looks just like what we'd see for independent data, only with \( \frac{n}{1+2T} \) samples instead of \( n \), so we call the former the effective sample size. All of this is about the convergence of averages of \( X_t \), and based on its covariance function \( c_h \). What if we care not about \( X \) but about \( f(X) \)? The same idea would apply, but unless \( f \) is linear, we can't easily get its covariance function from \( c_h \). The mathematicians' solution to this has been to invent stronger notions of decay-of-correlations, called "mixing". Very roughly speaking, we say that \( X \) is mixing when, if you pick any two (nice) functions \( f \) and \( g \), I can always show that \[ \lim_{h\rightarrow\infty}{\mathrm{Cov}\left[ f(X_t), g(X_{t+h}) \right]} = 0 \] Note (or believe) that this is "convergence in distribution"; it happens if, and only if, the distribution of events up to time \( t \) is becoming independent of the distribution of events from time \( t+h \) onwards. To get useful results, it is necessary to quantify mixing, which is usually done through somewhat stronger notions of dependence. (Unfortunately, none of these have meaningful names. The review by Bradley ought to be the standard reference.) For instance, the "total variation" or \( L_1 \) distance between probability measures \( P \) and \( Q \), with densities \( p \) and \( q \) is, \[ d_{TV}(P,Q) = \frac{1}{2}\int{|p(u) - q(u)| du} \] This has several interpretations, but the easiest to grasp is that it says how much \( P \) and \( Q \) can differ in the probability they give to any one event: for any \( E \), \( d_{TV}(P,Q) \geq |P(E) - Q(E)| \). One use of this distance is to measure how the dependence between random variables, by seeing far their joint distribution is from the product of their marginal distributions. Abusing notation a little to write \( P(U,V) \) for the joint distribution of \( U \) and \( V \), we measure dependence as \[ \beta(U,V) \equiv d_{TV}(P(U,V), P(U) \otimes P(V)) = \frac{1}{2}\int{|p(u,v)-p(u)p(v)|du dv} \] This will be zero just when \( U \) and \( V \) are statistically independent, and one when, on average, conditioning on \( U \) confines \( V \) to a set which would otherwise have probability zero. (For instance if \( U \) has a continuous distribution and \( V \) is a function of \( U \) — or one of two randomly chosen functions of \( U \).) We can relate this back to the earlier idea of correlations between functions by realizing that \[ \beta(U,V) = \sup_{|r|\leq 1}{\left|\int{r(u,v) dP(U,V)} - \int{r(u,v)dP(U)dP(V)}\right|} ~, \] that \( \beta \) says how much the expected value of a bounded function \( r \) could change between the dependent and the independent distributions. (There is no assumption that the test function \( r \) factorizes, and in fact it's important to allow \( r(u,v) \neq f(u)g(v) \).) We apply these ideas to time series by looking at the dependence between the past and the future: \[ \begin{eqnarray*} \beta(h) & \equiv & d_{TV}(P(X^t_{-\infty}, X_{t+h}^{\infty}), P(X^t_{-\infty}) \otimes P(X_{t+h}^{\infty})) \\ & = & \frac{1}{2}\int{|p(x^t_{-\infty},x_{t+h}^{\infty})-p(x^t_{-\infty})p(x^{\infty}_{t+h})|dx^t_{-\infty}dx^{\infty}_{t+h}} \end{eqnarray*} \] (By stationarity, the integral actually does not depend on \( t \).) When \( \beta(h) \rightarrow 0 \) as \( h \rightarrow \infty \), we have a "beta-mixing" process. (These are also called "absolutely regular".) Convergence in total variation implies convergence in distribution, but not vice versa, so beta-mixing is stronger than common-or-garden mixing. Notions like beta-mixing were originally introduced purely for probabilistic convenience, to handle questions like "when does the central limit theorem hold for stochastic processes?" These are interesting for people who like stochastic processes, or indeed for those who want to do Markov chain Monte Carlo and want to know how long to let the chain run. For our purposes, though, what's important is that when people in statistical learning theory have given serious attention to dependent data, they have usually relied on a beta-mixing assumption. The reason for this focus on beta-mixing is that it "plays nicely" with approximating dependent processes by independent ones. The usual form of such arguments is as follows. We want to prove a result about our dependent but mixing process \( X \). For instance, we realize that our favorite prediction model will tend to do worse out-of-sample than on the data used to fit it, and we might want to bound the probability that this over-fitting will exceed \( \epsilon \). If we know the beta-mixing coefficients \( \beta(h) \), we can pick a separation, call it \( a \), where \( \beta(a) \) is reasonably small. Now we divide \( X \) up into \( \mu = n/a \) blocks of length \( a \). If we take every other block, they're nearly independent of each other (because \( \beta(a) \) is small) but not quite (because \( \beta(a) \neq 0 \)). Introduce a (fictitious) random sequence \( Y \), where blocks of length \( a \) have the same distribution as the blocks in \( X \), but there's no dependence between blocks. Since \( Y \) is an IID process, it is easy for us to prove that, for instance, the probability of over-fitting \( Y \) by more than \( \epsilon \) is at most some small \( \delta(\epsilon,\mu/2) \). Since \( \beta \) tells us about how well dependent probabilities are approximated by independent ones, the probability of the bad event happening with the dependent data is at most \( \delta(\epsilon,\mu/2) + (\mu/2)\beta(a) \). We can make this as small as we like by letting \( \mu \) and \( a \) both grow as the time series gets longer. Basically, anything result which holds for an IID process will also hold for a beta-mixing one, with a penalty in the probability that depends on \( \beta \). There are some details to fill in here (how to pick the separation \( a \)? should the blocks always be the same length as the "filler" between blocks?), but this is the basic frame. What it leaves open, however, is how to estimate the mixing coefficients \( \beta(h) \). For Markov models, one could it principle calculate it from the transition probabilities. For more general processes, though, calculating beta from the known distribution is not easy. In fact, we are not aware of any previous work on estimating the \( \beta(h) \) coefficients from observational data. (References welcome!) Because of this, even in learning theory, people have just assumed that the mixing coefficients were known, or that it was known they went to zero at a certain rate. This was not enough for what we wanted to do, which was actually calculate bounds on error from data. There were two tricks to actually coming up with an estimator. The first was to reduce the ambitions a little bit. If you look at the equation for \( \beta(h) \) above, you'll see that it involves integrating over the infinite-dimensional distribution. This is daunting, so instead of looking at the whole past and future, we'll introduce a horizon, \( d \) steps away, and cut things off there: \[ \begin{eqnarray*} \beta^{(d)}(h) & \equiv & d_{TV}(P(X^t_{t-d}, X_{t+h}^{t+h+d}), P(X^t_{t-d}) \otimes P(X_{t+h}^{t+h+d})) \\ & = & \frac{1}{2}\int{|p(x^t_{t-d},x_{t+h}^{t+h+d})-p(x^t_{t-d})p(x^{t+h+d}_{t+h})|dx^t_{t-d}dx^{t+h+d}_{t+h}} \end{eqnarray*} \] If \( X \) is a Markov process, then there's no difference between \( \beta^{(d)}(h) \) and \( \beta(h) \). If \( X \) is a Markov process of order \( p \), then \( \beta^{(d)}(h) = \beta(h) \) once \( d \geq p \). If \( X \) is not Markov at any order, it is still the case that \( \beta^{(d)}(h) \rightarrow \beta(h) \) as \( d \) grows. So we have an approximation to \( \beta \) which only involves finite-dimensional integrals, which we might have some hope of doing. The other trick is to get rid of those integrals. Another way of writing the beta-dependence between the random variables \( U \) and \( V \) is \[ \beta(U,V) = \sup_{\mathcal{A},\mathcal{B}}{\frac{1}{2}\sum_{a\in\mathcal{A}}{\sum_{b\in\mathcal{B}}{\left| \Pr{(a \cap b)} - \Pr{(a)}\Pr{(b)} \right|}}} \] where \( \mathcal{A} \) runs over finite partitions of values of \( U \), and \( \mathcal{B} \) likewise runs over finite partitions of values of \( V \). I won't try to show that this formula is equivalent to the earlier definition, but I will contend that if you think about how that integral gets cashed out as a sum, you can sort of see how it would be. If we want \( \beta^{(d)}(h) \), we can take \( U = X^{t}_{t-d} \) and \( V = X^{t+h+d}_{t+h} \), and we could find the dependence by taking the supremum over partitions of those two variables. Now, suppose that the joint density \( p(x^t_{t-d},x_{t+h}^{t+h+d}) \) was piecewise constant, with those pieces being rectangles parallel to the coordinate axes. Then sub-dividing those rectangles would not change the sum, and the \( \sup \) would actually be attained for that particular partition. Most densities are not of course piecewise constant, but we can approximate them by such piecewise-constant functions, and make the approximation arbitrarily close (in total variation). More, we can estimate those piecewise-constant approximating densities from a time series. Those estimates are, simply, histograms, which are about the oldest form of density estimation. We show that histogram density estimates converge in total variation on the true densities, when the bin-width is allowed to shrink as we get more data. Because the total variation distance is in fact a metric, we can use the triangle inequality to get an upper bound on the true beta coefficient, in terms of the beta coefficients of the estimated histograms, and the expected error of the histogram estimates. All of the error terms shrink to zero as the time series gets longer, so we end up with consistent estimates of \( \beta^{(d)}(h) \). That's enough if we have a Markov process, but in general we don't. So we can let \( d \) grow as \( n \) does, and that (after a surprisingly long measure-theoretic argument) turns out to do the job: our histogram estimates of \( \beta^{(d)}(h) \), with suitably-growing \( d \), converge on the true \( \beta(h) \). To confirm that this works, the papers go through some simulation examples, where it's possible to cross-check our estimates. We can of course also do this for empirical time series. For instance, in his this Daniel took four standard macroeconomic time series for the US (GDP, consumption, investment, and hours worked, all de-trended in the usual way). This data goes back to 1948, and is measured four times a year, so there are 255 quarterly observations. Daniel estimated a \( \beta \) of 0.26 at one quarter's separation, \( \widehat{\beta}(2) = 0.15 \), \( \widehat{\beta}(3) = 0.02 \), and somewhere between 0 and 0.11 for \(\widehat{\beta}(4) \). (That last is a sign that we don't have enough data to go beyond \( h = 4 \).) Optimistically assuming no dependence beyond a year, one can calculate the effective number of independent data points, which is not 255 but 31. This has morals for macroeconomics which are worth dwelling on, but that will have to wait for another time. (Spoiler: \( \sqrt{\frac{1}{31}} \approx 0.18 \), and that's if you're lucky.) It's inelegant to have to construct histograms when all we want is a single number, so it wouldn't surprise us if there were a slicker way of doing this. (For estimating mutual information, which is in many ways analogous, estimating the joint distribution as an intermediate step is neither necessary nor desirable.) But for now, we can do it, when we couldn't before. Posted by crshalizi at April 20, 2012 14:57 | permanent link
April 15, 2012Graphical Causal Models (Advanced Data Analysis from an Elementary Point of View)
Probabilistic prediction is about passively selecting a sub-ensemble, leaving all the mechanisms in place, and seeing what turns up after applying that filter. Causal prediction is about actively producing a new ensemble, and seeing what would happen if something were to change ("counterfactuals"). Graphical causal models are a way of reasoning about causal prediction; their algebraic counterparts are structural equation models (generally nonlinear and non-Gaussian). The causal Markov property. Faithfulness. Performing causal prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules for linear models. Reading: Notes, chapter 22 Posted by crshalizi at April 15, 2012 20:03 | permanent link
Exam: Is This Test Really Necessary? (Advanced Data Analysis from an Elementary Point of View)
In which the analysis of multivariate data is recursively applied. Reading: Notes, assignment Posted by crshalizi at April 15, 2012 20:02 | permanent link
Graphical Models (Advanced Data Analysis from an Elementary Point of View)
Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth? Reading: Notes, chapter 21 Posted by crshalizi at April 15, 2012 20:01 | permanent link
Mixture Models (Advanced Data Analysis from an Elementary Point of View)
From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: planes again. Probabilistic clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions. Extended example: Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components. Reading: Notes, chapter 20; mixture-examples.R Posted by crshalizi at April 15, 2012 20:00 | permanent link
|