February 19, 2012

"From Data to Knowledge: Machine-Learning with Real-time & Streaming Applications" (Dept. of Signal Amplification)

Attention conservation notice: Intellectuals gathering in Berkeley to argue about "knowledge" and "revolution".

This looks like fun, and if I didn't have conflicting obligations I'd definitely be there.

From Data to Knowledge: Machine-Learning with Real-time & Streaming Applications

May 7-11 2012
On the Campus of the University of California, Berkeley

We are experiencing a revolution in the capacity to quickly collect and transport large amounts of data. Not only has this revolution changed the means by which we store and access this data, but has also caused a fundamental transformation in the methods and algorithms that we use to extract knowledge from data. In scientific fields as diverse as climatology, medical science, astrophysics, particle physics, computer vision, and computational finance, massive streaming data sets have sparked innovation in methodologies for knowledge discovery in data streams. Cutting-edge methodology for streaming data has come from a number of diverse directions, from on-line learning, randomized linear algebra and approximate methods, to distributed optimization methodology for cloud computing, to multi-class classification problems in the presence of noisy and spurious data.

This conference will bring together researchers from applied mathematics and several diverse scientific fields to discuss the current state of the art and open research questions in streaming data and real-time machine learning. The conference will be domain driven, with talks focusing on well-defined areas of application and describing the techniques and algorithms necessary to address the current and future challenges in the field.

Sessions will be accessible to a broad audience and will have a single track format with additional rooms for breakout sessions and posters. There will be no formal conference proceedings, but conference applicants are encouraged to submit an abstract and present a talk and/or poster.

See the conference page for submission details, schedules, etc.

Via conference organizer and CMU alumnus Joey Richards.

Enigmas of Chance; Signal Amplification

Posted by crshalizi at February 19, 2012 12:44 | permanent link

Talks Next Week

Attention conservation notice: Only of interest if you (1) like hearing people talk about statistics and machine learning, and (2) will be in Pittsburgh next week.

I have been remiss about advertising upcoming talks.

Mark Davenport, "To Adapt or Not To Adapt: The Power and Limits of Adaptivity for Sparse Estimation"
Abstract: In recent years, the fields of signal processing, statistical inference, and machine learning have come under mounting pressure to accommodate massive amounts of increasingly high-dimensional data. Despite extraordinary advances in computational power, the data produced in application areas such as imaging, remote surveillance, meteorology, genomics, and large scale network analysis continues to pose a number of challenges. Fortunately, in many cases these high-dimensional signals contain relatively little information compared to their ambient dimensionality. For example, signals can often be well-approximated as sparse in a known basis, as a matrix having low rank, or using a low-dimensional manifold or parametric model. Exploiting this structure is critical to any effort to extract information from such data.
In this talk I will overview some of my recent research on how to exploit such models to recover high-dimensional signals from as few observations as possible. Specifically, I will primarily focus on the problem of estimating a sparse vector from a small number of noisy measurements. To begin, I will consider the case where the measurements are acquired in a nonadaptive fashion. I will establish a lower bound on the minimax mean-squared error of the recovered vector which very nearly matches the performance of $\ell1$-minimization techniques, and hence shows that these techniques are essentially optimal. I will then consider the case where the measurements are acquired sequentially in an adaptive manner. I will prove a lower bound that shows that, surprisingly, adaptivity does not allow for substantial improvement over standard nonadaptive techniques in terms of the minimax MSE. Nonetheless, I will also show that there are important regimes where the benefits of adaptivity are clear and overwhelming.
Time and place: 4--5 pm on Monday, 20 February 2012, in Scaife Hall 125
Ambuj Tewari, "From Probabilistic to Game Theoretic Foundations for Learning and Prediction"
Abstract: The probabilistic approach to prediction problems assumes that the data is generated from an underlying stochastic process. A reasonable goal then is to minimize the expected loss, or risk. The game theoretic approach, in contrast, views prediction as a repeated game between the learner and an adversary. The learner's goal then is to do well no matter what strategy is followed by the adversary. Minimizing regret is one of the well known ways to operationalize the notion of doing well. With a long history in varied disciplines such as Computer Science, Economics, Information Theory, and Statistics, the game theoretic approach has witnessed a vigorous development. Yet the suite of standard tools available for the probabilistic setting, such as Rademacher & Gaussian averages, covering numbers, and combinatorial dimensions, was missing in the game theoretic setting. In this talk, I will show how it is indeed possible to develop analogues of these tools for the game theoretic setting. Unlike the probabilistic setting, where empirical risk minimization is a canonical algorithm, we will not be able to exhibit a corresponding canonical algorithm for the game theoretic setting. However, under the additional assumption of convexity, I will show that Mirror Descent, a classic algorithm from optimization theory, is a canonical algorithm achieving minimax regret rates.
(Talk is based on papers written jointly with Alexander Rakhlin, Nathan Srebro, and Karthik Sridharan.)
Time and place: 10--11 am on Wednesday, 22 February 2012, in Gates Hall 6115
Forrest W. Crawford, "Birth, Death, Sex, Lies: Markov Counting Processes in Genetics and Beyond"
Abstract: A general birth-death process (BDP) is a continuous-time Markov chain that counts the number of particles in a system over time. At any moment in time, a particle may give birth or die, and the rate at which these events occur depends on the number of particles in the system at that time. While widely used in population biology, genetics, and evolution, statistical inference techniques for general BDPs remain elusive. In fact, the likelihood of a discrete observation from many of these processes cannot be written in closed form. In this talk, I outline several fundamental results that allow computation of transition probabilities and maximum likelihood estimates for general BDPs. I apply these novel methods to three important applied problems. First, I describe a technique for determining the effect of antibody treatment on the growth of lymphoma cells in vitro. Second, I investigate the evolution of DNA microsatellites in humans and chimpanzees using a log-linear model for the rates of repeat duplication and deletion. Finally, I use a BDP to infer true counts of sex acts from rounded self-reported counts in a longitudinal study of risky behaviors in young people living with HIV. These applications illustrate the mathematical, statistical, and computational challenges involved in learning from BDPs in biology, medicine, and public health.
Time and place: 4--5 pm on Wednesday, 22 February 2012, in Scaife Hall 125
Ron Bekkerman, "Scaling Up Machine Learning"
Abstract: In this talk, I'll provide an extensive introduction to parallel and distributed machine learning. I'll answer the questions "How actually big is the big data?", "How much training data is enough?", "What do we do if we don't have enough training data?", "What are platform choices for parallel learning?" etc. Over an example of k-means clustering, I'll discuss pros and cons of machine learning in Apache Pig, MPI, DryadLINQ, and CUDA. Time permitting, I'll take a dive into a super large scale text categorization task.
Time and place: 1:30--2:30 pm on Thursday, 23 February 2012, in Newell-Simon Hall 1305

As always, the talks are free and open to the public.

(You see why I have trouble keeping up with these.)

Enigmas of Chance

Posted by crshalizi at February 19, 2012 12:30 | permanent link

February 15, 2012

How the North American Mammalian Paleofauna Got a Crook in Its Curve (Advanced Data Analysis from an Elementary Point of View)

In which extinct charismatic megafauna give us an excuse to practice basic programming, bootstrapping, and specification testing.

Assignment, R

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 15, 2012 14:15 | permanent link

Testing Regression Specifications (Advanced Data Analysis from an Elementary Point of View)

Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.

Reading: Notes, chapter 10

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 15, 2012 14:10 | permanent link

Writing R Code (Advanced Data Analysis from an Elementary Point of View)

A change to the lecture schedule, by popular demand!

R programs are built around functions: pieces of code that take inputs or arguments, do calculations on them, and give back outputs or return values. The most basic use of a function is to encapsulate something we've done in the terminal, so we can repeat it, or make it more flexible. To assure ourselves that the function does what we want it to do, we subject it to sanity-checks, or "write tests". To make functions more flexible, we use control structures, so that the calculation done, and not just the result, depends on the argument. R functions can call other functions; this lets us break complex problems into simpler steps, passing partial results between functions. Programs inevitably have bugs: debugging is the cycle of figuring out what the bug is, finding where it is in your code, and fixing it. Good programming habits make debugging easier, as do some tricks. Avoiding iteration. Re-writing code to avoid mistakes and confusion, to be clearer, and to be more flexible.

Reading: Notes, chapter 9

Optional reading: Slides from 36-350, introduction to statistical computing, especially through lecture 15.

R for in-class demos (based around the previous problem set)

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 15, 2012 14:05 | permanent link

Cozy Catastrophes

Attention conservation notice: Academics with blogs quibbling about obscure corners of applied statistics.

Lurkers in e-mail point me to this pushback against the general pushback against power laws, and ask me to comment. It might be a mistake to do so, but I'm feeling under the weather and so splenetic, so I will.

In our paper, we looked at 24 quantities which people claimed showed power law distributions. Of these, there were seven cases where we could flat-out reject a power law, without even having to consider an alternative, because the departures of the actual distribution from even the best-fitting power law was much too large to be explained away as fluctuations. (One of the wonderful thing about a stochastic model is that it tells you how big its own errors should be.) In contrast, there was only one data set where we could rule out the log-normal distribution.

In some of those cases, you can patch things up, sort of, by replacing a pure power law with a power-law with an exponential cut-off. That is, rather than the probability density being proportional to x-a, it's proportional to x-ae-x/L. (Either way, I am only talking about the probability density in the "right tail", i.e., for x above some xmin.) This gives the infamous straight-ish patch on a log-log plot, for values of x much smaller than L, but otherwise it has substantially different properties. In ten of the twelve cases we looked at, the only way to save the idea of a power-law at all is to include this exponential cut-off. But that exponentially-shrinking factor is precisely what squelches the WTF, X IS ELEVENTY TIMES LARGER THAN EVER! THE BIG ONE IS IN OUR BASE KILLING OUR DOODZ!!!!1!! mega-events. There were ten more cases where we judged the support for power laws as "moderate", meaning "the power law is a good fit but that there are other plausible alternatives as well" (pardon the self-quotation.) Again, those alternatives, like log-normals and stretched exponentials, give very different tail-behavior, with not so much OMG DOOM.

We found exactly one case where the statistical evidence for the power-law was "good", meaning that "the power law is a good fit and that none of the alternatives considered is plausible", which was Zipf's law of word frequency distributions. We were of course aware that when people claim there are power laws, they usually only mean that the tail follows a power law. This is why all these comparisons were about how well the different distributions fit the tail, excluding the body of the data. We even selected where "the tail" begins to maximize the fit to a power law for each case. Even so, there was just this one case where the data compelling support a power law tail.

(All of this — the meaning of "with cut-off", the meaning of our categorizations, the fact that we only compare the tails, etc. — is clear enough from our paper, if you actually read the text. Or even just the tables and their captions.)

I bring up the OMG DOOM because some people, Hanson very much included, like to extrapolate from supposed power laws for various Bad Things to scenarios where THE BIG ONE kills off most of humanity. But, at least with the data we found, the magnitudes of forest fires, solar flares, earthquakes and wars were all better fit by log-normals, by stretched exponentials and by cut-off power laws than by power laws. For fires, flares and quakes, the differences are large enough that they clearly fall into the "with cut-off only" category. The differences in fits for the war-death data are smaller, as (mercifully) is the sample size, so we put it in the "moderate" support category. If you had some compelling other reason to insist on a power law rather than (e.g.) a log-normal there, the data wouldn't slap you down, but they wouldn't back you up either.

Now, I relish the schadenfreude-laden flavors of a mega-disaster scenario as much as the next misanthropic, science-fiction-loving geek, especially when it's paired with some "The fools! Can't they follow simple math?" on the side. Truly, I do. But squeezing that savory, juicy DOOM out of (for instance) the distribution of solar flares relies on the shape of the tail, i.e., whether it's a pure power law or not. The weak support, in the data, for such powers law means you don't really have empirical evidence for your scenarios, and in some cases what evidence there is tells against them. It's a free country, so you can go on telling those stories, but don't pretend that they owe more to confronting hard truths than to literary traditions.

Power Laws

Posted by crshalizi at February 15, 2012 14:00 | permanent link

February 13, 2012

Of Variance Explained; or, Chronicles of Deaths Smoothed

Attention conservation notice: 1500 word pedagogical-statistical rant, with sarcasm, mathematical symbols, computer code, and a morally dubious affectation of detachment from the human suffering behind the numbers. Plus the pictures are boring.
Does anyone know when the correlation coefficient is useful, as opposed to when it is used? If so, why not tell us?
— Tukey (1954: 721)

If you have taken any sort of statistics class at all, you have probably been exposed to the idea of the "proportion of variance explained" by a regression, conventionally written R2. This has two definitions, which happen to coincide for linear models fit by least squares. The first is to take the correlation between the model's predictions and the actual values (R) and square it (R2), getting a number which is guaranteed to be between 0 and 1. You get 1 only when the predictions are perfectly correlated with reality, and 0 when there is no linear relationship between them. The other definition is the ratio of the variance of the predictions to the variance of the actual values. It is this latter which leads to the notion that R2 is the proportion of variance explained by the model.

The use of the word "explained" here is quite unsupported and often actively misleading. Let me go over some examples to indicate why.

Start by supposing that a linear model is true:

Y = a + bX + noise
where the noise has constant variance s, and is uncorrelated with X. Suppose that we know this is the model to use, and suppose further that, as a reward for our scrupulous peer-review of anonymous manuscripts, the Good Fairy of Statistical Modeling tells us the correct values of the parameters a and b. Surely, with the right parameters in the right model, our R2 must be very high?

Well, no. The answer depends on the variance of X, which it will be convenient to call v. The variance of the predictions is b2 v, but the variance of Y is larger, b2 v + s. The ratio is

R2 = [b2 v]/[b2 v + s]
(You can check that this is also the squared correlation between the predictions and Y.) As v shrinks, this tends 0/s = 0. As v grows, this tends to 1. The relationship between X and Y doesn't change, the accuracy and precision with which Y can be predicted from X do not change, but R2 can wander all through its range, just depending on how dispersed X is.

Now, you say, this is a silly algebraic curiosity. Never mind the Good Fairy of Statistical Modeling handing us the correct parameters, let's talk about something gritty and real, like death in Chicago.

Number of deaths each day in Chicago, 1 January 1987--31 December 2000, from all causes except accidents. (Click this and all later figures for larger PDF versions. See below for link to code.)

I can relate deaths to time in any number of ways; the next figure shows what I get when I use a smoothing spline (and use cross-validation to pick how much smoothing to do). The statistical model is

death = f0(date) + noise
with f0 being a function learned from the data.
As before, but with the addition of a smoothing spline.

The root-mean-square error of the smoothing spline is just above 12 deaths/day. The R2 of the fit is either 0.35 (squared correlation between predicted and actual deaths) or 0.33 (variance of predicted deaths over variance of actual deaths). It seems absurd, however, to say that the date explains how many people died in Chicago on a given day, or even the variation from day to day. The closest I can come up with to an example of someone making such a claim would be an astrologer, and even one of them would work in some patter about the planets and their influences. (Numerologists, maybe? I dunno.)

Worse is to follow. The same data set which gives me these values for Chicago includes other variables, such as the concentration of various atmospheric pollutants and temperature. I can fit an additive model, which tries to tease out the separate relationships between each of those variables and deaths in Chicago, without presuming a particular functional form for each relationship. In particular I can try the model

deaths = f1(sulfur dioxide) + f2(particulates) + f3(temperature, ozone) + noise
where the functions f1, f2 and f3 are all learned from data. (Exercise: why do I do a joint smoothing against temperature and ozone?) When I do that, I get functions which look like the following.
Estimated partial response functions for concentration of sulfur dioxide, concentration of particulates, and (jointly) temperature and concentration of ozone, all taken as averages over four-day moving windows.

The R2 of this model is 0.27. Is this "variance explained"? Well, it's at least not incomprehensible to talk about changes in temperature or pollution explaining changes in mortality. In fact, adding this model's predictions to the simple spline's, we see that most of what the spline predicted from the date is predictable from pollution and temperature:

Black dots: actual death counts. Red curve: spline smoothing on the date alone. Blue lines: predictions from the temperature-and-pollution model.
But notice it is not anything in the math or the statistics which tells us that this a step closer to something we might, unblushingly, call an "explanation". The astrologer, after all, could look at this figure the other way, and say that really pollution and temperature are just crude proxies for the position of Mars (or whatever).

We could, in fact, try to include the date in this larger model:

deaths = f0(date) + f1(sulfur dioxide) + f2(particulates) + f3(temperature, ozone) + noise
Of course, we have to re-estimate all the functions, but as it turns out they don't change very much. (I'd show you the plot of the fitted values over time as well, but visually it's almost indistinguishable from the last one.)

Despite the lack of visual drama, putting a smooth function of time back into the model increases R2, from 0.27 to 0.30. Formally, the date enters into the model in exactly the same way as particulate pollution. But, again, only a fortune teller — an unusually numerate fortunate teller, perhaps a subscriber to the Journal of Evidence-Based Haruspicy — would say that the date explains, or helps explain, 3% of the variance.

I hope that by this point you will at least hesitate to think or talk about R2 as "the proportion of variance explained". (I will not insist on your never talking that way, because you might need to speak to the deluded in terms they understand.) How then should you think about it? I would suggest: the proportion of variance retained, or just kept, by the predictions. Linear regression is a smoothing method. (It just smoothes everything on to a line, or more generally a hyperplane.) It's hard for any smoother to give fitted values which have more variance than the variable it is smoothing. R2is merely the fraction of the target's variance which is not smoothed away.

This of course raises the question of why you'd care about this number at all. If prediction is your goal, then it would seem much more natural to look at mean squared error. (Or really root mean squared error, so it's in the same units as the variable predicted.) Or mean absolute error. Or median absolute error. Or a genuine loss function. If on the other hand you want to get some function right, then your question is really about mis-specification, and/or confidence sets of functions, and not about whether your smoother is following every last wiggle of the data at all. If you want an explanation, the fact that there is a peak in deaths every year of about the same height, but the predictions fall short of it, suggests that this model is missing something. The fact that the data shows something awful happened in 1995 and the model has nothing adequate to say about it suggests that whatever's missing is very important.

Code for reproducing the figures and analyses in R. (I make this public, despite the similarity of this exercise to the last problem-set in advanced data analysis, because (i) it's not exactly the same, (ii) the homework is due in ten hours, (iii) none of my students would dream of copying this and turning it in as their own, and (iv) I borrowed the example from Simon Wood's Generalized Additive Models.)

Enigmas of Chance

Posted by crshalizi at February 13, 2012 23:54 | permanent link

Power Law News

1. I'd like to say that you have no idea how long I have waited to read something like this piece by Michael Stumpf and Mason Porter in one of the glossy journals. But that would be a lie, because if you've been reading this for any length of time, you know that the answer is, long enough to be very tiresome about it. If the referees, and still more the editors, at those journals can be persuaded to pay attention, we will be on track for my mid-2007 hope that "in five to ten years even science journalists and editors of Wired will begin to get the message." (I never really had any hopes for Wired.)

2. You can imagine how my heart sank to see that Krugman had a post titled "The Power (Law) of Twitter" — and my relief to see that he's not actually saying that the distribution of followers is a power law. It is however interesting that the distribution is so close to a log-normal.

3. My ex-boss and mentor Melanie Mitchell has a blog, and promises a substantive series of posts on power laws and scaling. In the meanwhile, go read her book.

Update, 15 February: see later post.

Manual trackback: Brendan O'Connor

(Nos. 1 and 2 via too many to list.)

Power Laws

Posted by crshalizi at February 13, 2012 20:40 | permanent link

February 09, 2012

Additive Models (Advanced Data Analysis from an Elementary Point of View)

The "curse of dimensionality" limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimensionality. Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This generalizes linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Examples in R using the California house-price data. Conclusion: there are no statistical reasons to prefer linear models to additive models, hardly any scientific reasons, and increasingly few computational ones; the continued thoughtless use of linear regression is a scandal.

Reading: Notes, chapter 8; Faraway, chapter 12

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 09, 2012 10:30 | permanent link

February 07, 2012

It's Not the Heat that Gets to You, It's the Sustained Conjunction of Heat with Elevated Levels of Atmospheric Pollutants (Advanced Data Analysis from an Elementary Point of View)

In which spline regression becomes a matter of life and death in Chicago.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 07, 2012 10:31 | permanent link

Splines (Advanced Data Analysis from an Elementary Point of View)

Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression.

Reading: Notes, chapter 7; Faraway, section 11.2.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 07, 2012 10:30 | permanent link

February 02, 2012

Heteroskedasticity, Weighted Least Squares, and Variance Estimation (Advanced Data Analysis from an Elementary Point of View)

Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.

Reading: Notes, chapter 6; Faraway, section 11.3.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 02, 2012 10:30 | permanent link

January 31, 2012

Books to Read While the Algae Grow in Your Fur, January 2012

Attention conservation notice: I have no taste.

Stephen Greenblatt, The Swerve: How the World Became Modern
A rather rambling and formless, if amiable and enthusiastic, popular history of Lucretius's De Rerum Natura and its rediscovery during the Renaissance. The grandiosity of the subtitle is not, thankfully, insisted upon in the text, which in fact says rather little about the quite interesting history of how Lucretius was taken up, and Epicurean ideas were elaborated on, in early modern Europe. Passages of novelistic you-are-there detail, which Greenblatt admits are totally made up, are mercifully brief and fairly clearly marked as such. (Such claims of influence as he does make strike me as very thinly supported, though not clearly wrong.) Enjoyable, if slight, if you are prepared to care very deeply about books, and to sympathize with philosophical materialism.
(I am not sure why Greenblatt writes that the only manuscripts we have from the ancient world are those from Herculaneum preserved by the eruption of Mt. Vesuvius. In Egypt and other desert countries, manuscripts have survived from Roman, Ptolemaic and even earlier times, some of them rather famous. But he is not a classicist, and one hopes he is a bit more careful about his own period.)
Margaret C. Jacob, Strangers Nowhere in the World: The Rise of Cosmopolitanism in Early Modern Europe
On the positive side, the subject is important, and there were lots of interesting anecdotes and suggestions. Against that, it is far too scatter-shot and lacks not only a single global argument, but even much cohesion within individual chapters. It is also far too limited in scope, to the Enlightenment and its immediate predecessors in the 17th century. But if one wanted to look even at what was distinctive about that sort of cosmopolitanism, it's very strange to not even try to compare it to Latinate humanism and earlier medieval traditions, or the way the travels of learned artists spread styles and ideas during the Renaissance and before. (Comparison with any other part of the world is of course too much to expect of a Europeanist, even one interested in cosmopolitanism.) Finally, Jacob makes causal claims — e.g., that alchemical ideas in early-modern natural philosophy were displaced by mechanical ones because the latter were less politically troubling to monarchies — with a sweep and assurance totally out of proportion to anything she presents by way of evidence or argument. Over-all of little value to me, but perhaps of more use to specialists in the period.
Amar Bhidé, A Call for Judgment: Sensible Finance for a Dynamic Economy
Full-length review: Hayek contra Chicago.
Rachel Loden, Dick of the Dead
Not as good as her superb Hotel Imperium, but still great:
The Idiad

Shall I write a poem about you
And your epic struggle against stupidity?
Feh. But if the brain is a city
I too have rooms in the swampy part, surrounded by crocodiles.
The monarch butterflies sail down from the Canadian Rockies
To overwinter in Pacific Grove, pair off and fly away;
They bruise me. I get crankier.
If you are coming down through the narrows of the Saugatuck
Please text me beforehand,
And I will come out to meet you
As far as Palookaville.

Gerda Claeskens and Nils Lid Hjort, Model Selection and Model Averaging
Full-length review: How Can You Choose Just One?.
Shorter me: the best available review of model selection from a statistical standpoint. Presumes a reader with some knowledge of asymptotic statistics.
Shirley Jackson, The Haunting of Hill House
Exactly as good, as monstrous, and as ambiguous, as I remember it (unlike The Sundial). One mark of its excellence is that its things that go bump in the night are perfectly convincing, and yet the real horrors are all those of the all-too-human mind. I am not sure what point there is to other haunted house stories, really.
ObLinkage: Kit Whitfield on the first paragraph of the novel. Whitfield is exactly right about the way "small, unnerving echoes whisper back and forth along her pages". (Take, please take, the ending, for example.)
Patrick O'Brian, The Letter of Marque; The Thirteen Gun Salute; The Nutmeg of Consolation; Clarissa Oakes / The Truelove
Books to Read While the Algae Grow in Your Fur; Writing for Antiquity; The Great Transformation; The Commonwealth of Letters; Scientifiction and Fantastica; Enigmas of Chance; The Dismal Science

Posted by crshalizi at January 31, 2012 23:59 | permanent link

How the Hyracotherium Got Its Mass (Advanced Data Analysis from an Elementary Point of View)

In which we consider evolutionary trends in body size, aided by regression modeling and the bootstrap.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 31, 2012 19:11 | permanent link

The Bootstrap (Advanced Data Analysis from an Elementary Point of View)

Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping. Non-parametric bootstrapping. Many examples. When does the bootstrap fail?

Reading: Notes, chapter 5 (R for figures and examples; pareto.R; wealth.dat)<; R for in-class examples

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 31, 2012 19:10 | permanent link

You think you want big data? You can't handle big data! (Next Week at the Statistics Seminar)

Fortunately, however, the methods of those who can handle big data are neither grotesque nor incomprehensible, and we will hear about them on Monday.

Alekh Agarwal, "Computation Meets Statistics: Trade-offs and Fundamental Limits for Large Data Sets"
Abstract: The past decade has seen the emergence of datasets of unprecedented scale, with both large sample sizes and dimensionality. Massive data sets arise in various domains, among them computer vision, natural language processing, computational biology, social networks analysis and recommendation systems, to name a few. In many such problems, the bottleneck is not just the number of data samples, but also the computational resources available to process the data. Thus, a fundamental goal in these problems is to characterize how estimation error behaves as a function of the sample size, number of parameters, and the computational budget available.
In this talk, I present three research threads that provide complementary lines of attack on this broader research agenda: (i) lower bounds for statistical estimation with computational constraints; (ii) interplay between statistical and computational complexities in structured high-dimensional estimation; and (iii) a computational budgeted framework for model selection. The first characterizes fundamental limits in a uniform sense over all methods, whereas the latter two provide explicit algorithms that exploit the interaction of computational and statistical considerations.
Joint work with John Duchi, Sahand Negahban, Clement Levrard, Pradeep Ravikumar, Peter Bartlett, and Martin Wainwright.
Time and place: 4--5 pm on Monday, 6 February 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Posted by crshalizi at January 31, 2012 19:00 | permanent link

"The Cut and Paste Process" (This Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about combinatorial stochastic processes and their statistical applications, and (2) will be in Pittsburgh on Wednesday afternoon.

It is only in very special weeks, when we have been very good, that we get two seminars.

Harry Crane, "The Cut-and-Paste Process"
Abstract: In this talk, we present the cut-and-paste process, a novel infinitely exchangeable process on the state space of partitions of the natural numbers whose samples paths differ from previously studied exchangeable coalescent (Kingman 1982; Pitman 1999) and fragmentation (Bertoin 2001) processes. Though it evolves differently, the cut-and-paste process possesses some of the same properties as its predecessors, including a unique equilibrium measure, associated measure-valued process, a Poisson point process construction and transition probabilities which can be described in terms of Kingman's paintbox process. A parametric subfamily is related to the Chinese restaurant process and we illustrate potential applications of this model to phylogenetic inference based on RNA/DNA sequence data. There are some natural extensions of this model to Bayesian inference, hidden Markov models and tree-valued Markov processes which we will discuss.
We also discuss how this process and its extensions fit into the more general framework of statistical modeling of structure and dependence via combinatorial stochastic processes, e.g. random partitions, trees and networks, and the practical importance of infinite exchangeability in this context.
Time and place: 4--5 pm on Wednesday, 1 February 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at January 31, 2012 18:45 | permanent link

January 28, 2012

Scientific Community to Elsevier: Drop Dead

Attention conservation notice: Associate editor at a non-profit scientific journal endorses a call for boycotting a for-profit scientific journal publisher.

I have for years been refusing to publish in or referee for journals publisher by Elsevier; pretty much all of the commercial journal publishers are bad deals1, but they are outrageously worse than most. Since learning that Elsevier had a business line in putting out publications designed to look like peer-reviewed journals, and calling themselves journals, but actually full of paid-for BS, I have had a form letter I use for declining requests to referee, letting editors know about this, and inviting them to switch to a publisher which doesn't deliberately seek to profit by corrupting the process of scientific communication.

I am thus extremely happy to learn from Michael Nielsen that Tim Gowers is organizing a general boycott of Elsevier, asking people to pledge not to contribute to its journals, referee for them, or do editorial work for them. You can sign up here, and I strongly encourage you to do so. There are fields where Elsevier does publish the leading journals, and where this sort of boycott would be rather more personally costly than it is in statistics, but there is precedent for fixing that. Once again, I strongly encourage readers in academia to join this.

(To head off the inevitable mis-understandings, I am not, today, calling for getting rid of journals as we know them. I am saying that Elsevier is ripping us off outrageously, that conventional journals can be published without ripping us off, and so we should not help Elsevier to rip us off.)

Disclaimer, added 29 January: As I should have thought went without saying, I am speaking purely for myself here, and not with any kind of institutional voice. In particular, I am not speaking for the Annals of Applied Statistics, or for the IMS, which publishes it. (Though if the IMS asked its members to join in boycotting Elsevier, I would be very happy.)

1: Let's review how scientific journals work, shall we? Scientists are not paid by journals to write papers: we do that as volunteer work, or more exactly, part of the money we get for teaching and from research grants is supposed to pay for us to write papers. (We all have day-jobs.) Journals are edited by scientists, who volunteer for this and get nothing from the publisher. (New editors get recruited by old editors.) Editors ask other scientists to referee the submissions; the referees are volunteers, and get nothing from the publisher (or editor). Accepted papers are typeset by the authors, who usually have to provide "camera-ready" copy. The journal publisher typically provides an electronic system for keeping track of submitted manuscripts and the refereeing process. Some of them also provide a minimal amount of copy-editing on accepted papers, of dubious value. Finally, the publisher actually prints the journal, and runs the server distributing the electronic version of the paper, which is how, in this day and age, most scientists read it. While the publisher's contribution isn't nothing, it's also completely out of proportion to the fees they charge, let alone economically efficient pricing. The whole thing would grind to a halt without the work done by scientists, as authors, editors and referees. That work, to repeat, is paid for either by our students or by our grants, not by the publisher. This makes the whole system of for-profit journal publication economically insane, a check on the dissemination of knowledge which does nothing to encourage its creation. Elsevier is simply one of the worst of these parasites.

Manual trackback: Cosmic Variance; Open A Vein; AgroEcoPeople; QED Insight

Learned Folly

Posted by crshalizi at January 28, 2012 11:15 | permanent link

January 27, 2012

Changing How Changes Change (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about covariance matrices and (2) will be in Pittsburgh on Monday.

Since so much of multivariate statistics depends on patterns of correlation among variables, it is a bit awkward to have to admit that in lots of practical contexts, correlations matrices are just not very stable, and can change quite drastically. (Some people pay a lot to rediscover this.) It turns out that there are more constructive responses to this situation than throwing up one's hands and saying "that sucks", and on Monday a friend of the department and general brilliant-type-person will be kind enough to tell us about them:

Emily Fox, "Bayesian Covariance Regression and Autoregression"
Abstract: Many inferential tasks, such as analyzing the functional connectivity of the brain via coactivation patterns or capturing the changing correlations amongst a set of assets for portfolio optimization, rely on modeling a covariance matrix whose elements evolve as a function of time. A number of multivariate heteroscedastic time series models have been proposed within the econometrics literature, but are typically limited by lack of clear margins, computational intractability, and curse of dimensionality. In this talk, we first introduce and explore a new class of time series models for covariance matrices based on a constructive definition exploiting inverse Wishart distribution theory. The construction yields a stationary, first-order autoregressive (AR) process on the cone of positive semi-definite matrices.
We then turn our focus to more general predictor spaces and scaling to high-dimensional datasets. Here, the predictor space could represent not only time, but also space or other factors. Our proposed Bayesian nonparametric covariance regression framework harnesses a latent factor model representation. In particular, the predictor-dependent factor loadings are characterized as a sparse combination of a collection of unknown dictionary functions (e.g., Gaussian process random functions). The induced predictor-dependent covariance is then a regularized quadratic function of these dictionary elements. Our proposed framework leads to a highly-flexible, but computationally tractable formulation with simple conjugate posterior updates that can readily handle missing data. Theoretical properties are discussed and the methods are illustrated through an application to the Google Flu Trends data and the task of word classification based on single-trial MEG data.
Time and place: 4--5 pm on Monday, 30 January 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at January 27, 2012 14:25 | permanent link

January 26, 2012

Smoothing Methods in Regression (Advanced Data Analysis from an Elementary Point of View)

The constructive alternative to complaining about linear regression is non-parametric regression. There are many ways to do this, but we will focus on the conceptually simplest one, which is smoothing; especially kernel smoothing. All smoothers involve local averaging of the training data. The bias-variance trade-off tells us that there is an optimal amount of smoothing, which depends both on how rough the true regression curve is, and on how much data we have; we should smooth less as we get more information about the true curve. Knowing the truly optimal amount of smoothing is impossible, but we can use cross-validation to select a good degree of smoothing, and adapt to the unknown roughness of the true curve. Detailed examples. Analysis o how quickly kernel regression converges on the truth. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.

Readings: Notes, chapter 4 (R); Faraway, section 11.1

Optional readings: Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2012 10:30 | permanent link

Advantages of Backwardness (Advanced Data Analysis from an Elementary Point of View)

In which we try to discern whether poor countries grow faster.

Assignment, R, penn-select.csv data set

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2012 09:30 | permanent link

January 24, 2012

Model Evaluation: Error and Inference (Advanced Data Analysis from an Elementary Point of View)

Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection. Justifying model-based inferences; Luther and Süleyman.

Reading: Notes, chapter 3 (R for examples and figures).

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 24, 2012 10:30 | permanent link

The Truth About Linear Regression (Advanced Data Analysis from an Elementary Point of View)

Multiple linear regression: general formula for the optimal linear predictor. Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions. Linear regression coefficients will change with the distribution of the input variables: examples. Why R2 is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable problems). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means.

Reading: Notes, chapter 2 (R for examples and figures); Faraway, chapter 1 (continued).

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 24, 2012 10:15 | permanent link

January 22, 2012

Dungeons and Debtors

Attention conservation notice: A silly idea about gamifying credit cards, which would be evil if it worked.

To make a profit in an otherwise competitive industry, it helps if you can impose switching costs on your customers, making them either pay to stop doing business with you, or give up something of value to them. There are whole books about this, written by respected economists1.

This is why credit card companies are happy to offer rewards for use: accumulating points on a card, which would not move with you if you got a new card and transferred the balance, is an attempt to create switching costs. Unfortunately, from the point of view of the banks, people will redeem their points from time to time, so some money must be spent on the rewards. The ideal would be points which people would value but which would never cost the bank anything.

Item: Computer games are, deliberately, addictive. Social games are especially addictive.

Accordingly, if I were an evil and unscrupulous credit card company (but I repeat myself), I would create an online game, where people could get points either from playing the game, or from spending money with my credit card. For legal reasons, I think it would probably be best to allow the game to technically be open to everyone, but with a registration fee which is, naturally, waived for card-holders. Of course, the game software would be set up to announce on Facebook (etc.) whenever the player/debtor leveled up. I would also be tempted to award double points for fees, and triple for interest charges, but one could experiment with this. If they close their credit card account, they have to start the game over from the beginning.

The fact that online acquaintances can't tell whether the debtor is advancing through spending or through game-play helps keep the reward points worth having. It's true that the credit card company has to pay for the game's design (a one-time start-up cost) and the game servers, but these are fairly cheap, and the bank never has to cash out points in actual dollars or goods. The debtors themselves do all the work of investing the points with meaning and value. They impose the switching costs on themselves.

My plan is sheer elegance in its simplicity, and I will be speaking to an attorney about a business method patent first thing Monday.

1: Much can be learned about our benevolent new-media overlords from the fact that this book carries a blurb from Jeff Bezos of Amazon, and that Varian now works for Google.

Modest Proposals;

Posted by crshalizi at January 22, 2012 10:15 | permanent link

January 17, 2012

"Can't seem to face up to the facts"

Attention conservation notice: An academic paper you've never heard of, about a distressing subject, had bad statistics and is generally foolish.

Because my so-called friends like to torment me, several of them made sure that I knew a remarkably idiotic paper about power laws was making the rounds, promoted by the ignorant and credulous, with assistance from the credulous and ignorant, supported by capitalist tools:

M. V. Simkin and V. P. Roychowdhury, "Stochastic modeling of a serial killer", arxiv:1201.2458
Abstract: We analyze the time pattern of the activity of a serial killer, who during twelve years had murdered 53 people. The plot of the cumulative number of murders as a function of time is of "Devil's staircase" type. The distribution of the intervals between murders (step length) follows a power law with the exponent of 1.4. We propose a model according to which the serial killer commits murders when neuronal excitation in his brain exceeds certain threshold. We model this neural activity as a branching process, which in turn is approximated by a random walk. As the distribution of the random walk return times is a power law with the exponent 1.5, the distribution of the inter-murder intervals is thus explained. We confirm analytical results by numerical simulation.

Let's see if we can't stop this before it gets too far, shall we? The serial killer in question is one Andrei Chikatilo, and that Wikipedia article gives the dates of death of his victims, which seems to have been Simkin and Roychowdhury's data source as well. Several of these are known only imprecisely, so I made guesses within the known ranges; the results don't seem to be very sensitive to the guesses. Simkin and Roychowdhury plotted the distribution of days between killings in a binned histogram on a logarithmic scale; as we've explained elsewhere, this is a bad idea, which destroys information to no good purpose, and a better display is shows the (upper or complementary) cumulative distribution function1, which looks like so:

When I fit a power law to this by maximum likelihood, I get an exponent of 1.4, like Simkin and Roychowdhury; that looks like this:

Update: The 95% (bootstrap) confidence interval for the exponent is (1.35,1.48), which you will notice excludes 1.5.

On the other hand, when I fit a log-normal (because Gauss is not mocked), we get this:

After that figure, a formal statistical test is almost superfluous, but let's do it anyway, because why just trust our eyes when we can calculate? The data are better fit by the log-normal than by the power-law (the data are e10.41 or about 33 thousand times more likely under the former than the latter), but that could happen via mere chance fluctuations, even when the power law is right. Vuong's model comparison test lets us quantify that probability, and tells us a power-law would produce data which seems to fit a log-normal this well no more than 0.4 percent2 of the time. Not only does the log-normal distribution fit better than the power-law, the difference is so big that it would be absurd to try to explain it away as bad luck. In absolute terms, we can find the probability of getting as big a deviation between the fitted power law and the observed distribution through sampling fluctuations, and it's about 0.03 percent2b [R code for figures, estimates and test, including data.]

Since Simkin and Roychowdhury's model produces a power law, and these data, whatever else one might say about them, are not power-law distributed, I will refrain from discussing all the ways in which it is a bad model. I will re-iterate that it is an idiotic paper — which is different from saying that Simkin and Roychowdhury are idiots; they are not and have done interesting work on, e.g., estimating how often references are copied from bibliographies without being read by tracking citation errors4. But the idiocy in this paper goes beyond statistical incompetence. The model used here was originally proposed for the time intervals between epileptic fits. The authors realize that

[i]t may seem unreasonable to use the same model to describe an epileptic and a serial killer. However, Lombroso [5] long ago pointed out a link between epilepsy and criminality.
That would be the 19th-century pseudo-scientist3 Cesare Lombroso, who also thought he could identify criminals from the shape of their skulls; for "pointed out", read "made up". Like I said: idiocy.

As for the general issues about power laws and their abuse, say something once, why say it again?

Update 9 pm that day: Added the goodness-of-fit test (text before note 2b, plus that note), updated code, added PNG versions of figures, added attention conservation notice.
21 January: typo fixes (missing pronoun, mis-placed decimal point), added bootstrap confidence interval for exponent, updated code accordingly.

Manual trackback: Hacker News (do I really need to link to this?), Naked Capitalism (?!); Mathbabe; Wolfgang Beirl; Ars Mathematica (yes, I am that predictable)

1: This is often called the "survival function", but that seems inappropriate here.

2: On average, the log-likelihood of each observation was 0.20 higher under the log-normal than under the power law, and the standard deviation of the log likelihood ratio over the samples was only 0.54. The test statistic thus comes out to -2.68, and the one-sided p-value to 0.36%.

2b: Use a Kolmogorov-Smirnov test. Since the power law has a parameter estimated from data (namely, the exponent), we can't just plug in to the usual tables for a K-S test, but we can find a p-value by simulating the power law (as in my paper with Aaron and Mark), and when I do that, with a hundred thousand replications, the p-value is about 3*10-4.

3: There are in fact subtle, not to say profound, issues in the sociology and philosophy of science here: was Lombroso always a pseudo-scientist, because his investigations never came up to any acceptable standard of reliable inquiry? Or just because they didn't come up to the standards of inquiry prevalent at the time he wrote? Or did Lombroso become a pseudo-scientist, when enough members of enough intellectual communities woke up from the pleasure of having their prejudices about the lower orders echoed to realize that he was full of it? However that may be, this paper has the dubious privilege of being the first time I have ever seen Lombroso cited as an authority rather than a specimen.

4: Actually, for several years my bibliography data base had the wrong page numbers for one of my own papers, due to a typo, so their method would flag some of my subsequent works as written by someone who had cited that paper without reading it, which I assure you was not the case. But the idea seems reasonable in general.

Power Laws; Learned Folly

Posted by crshalizi at January 17, 2012 20:23 | permanent link

What's That Got to Do with the Price of Condos in California? (Advanced Data Analysis from an Elementary Point of View)

In which we practice the art of linear regression upon the California real-estate market, by way of warming up for harder tasks.

Assignment, data set

(Yes, the data set is now about as old as my students, but last week in Austin I was too busy drinking on 6th street having lofty conversations about the future of statistics to update the file with the UScensus2000 package.)

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 17, 2012 10:31 | permanent link

Regression: Predicting and Relating Quantitative Features (Advanced Data Analysis from an Elementary Point of View)

Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.

Readings: Notes, chapter 1; Faraway, chapter 1, through page 17.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 17, 2012 10:30 | permanent link

January 07, 2012

Mail Woes

If you sent me e-mail at my @stat.cmu.edu address in the last few days, I haven't gotten it, and may never get it. The address firstinitiallastname at cmu dot edu now points somewhere where I can read.

Posted by crshalizi at January 07, 2012 20:40 | permanent link

January 06, 2012

Sloth in Austin

I'll be speaking at UT-Austin next week, through the kindness of the division of statistics and scientific computation:

"When Can We Learn Network Models from Samples?"
Abstract: Statistical models of network structure are models for the entire network, but the data are typically just a sampled sub-network. Parameters for the whole network, which are what we care about, are estimated by fitting the model on the sub-network. This assumes that the model is "consistent under sampling" (forms a projective family). For the widely-used exponential random graph models (ERGMs), this trivial-looking condition is violated by many popular and scientifically appealing models; satisfying it drastically limits ERGMs' expressive power. These results are special cases of more general ones about exponential families of dependent variables, which we also prove. As a consolation prize, we offer easily checked conditions for the consistency of maximum likelihood estimation in ERGMs, and discuss some possible constructive responses.
Time and place: 2--3 pm on Wednesday, 11 January 2012, in Hogg Building (WCH), room 1.108

This will of course be based on my paper with Alessandro, but since I understand some non-statisticians may sneak in, I'll try to be more comprehensible and less technical.

Since this will be my first time in Austin (indeed my first time in Texas), and I have (for a wonder) absolutely no obligations on the 12th, suggestions on what I should see or do would be appreciated.

Self-Centered

Posted by crshalizi at January 06, 2012 14:15 | permanent link

January 03, 2012

Course Announcement: Advanced Data Analysis from an Elementary Point of View

It's that time again:

36-402, Advanced Data Analysis, Spring 2012
Description: This course introduces modern methods of data analysis, building on the theory and application of linear models from 36-401. Topics include nonlinear regression, nonparametric smoothing, density estimation, generalized linear and generalized additive models, simulation and predictive model-checking, cross-validation, bootstrap uncertainty estimation, multivariate methods including factor analysis and mixture models, and graphical models and causal inference. Students will analyze real-world data from a range of fields, coding small programs and writing reports.
Prerequisites: 36-401 (modern regression); or consent of instructor, in extraordinary cases
Time and place: 10:30--11:50 am, Tuesdays and Thursdays, in Porter Hall 100
Note: Graduate students in other departments wishing to take this course for credit need consent of the instructor, and should register for 36-608.

Fuller details on the class homepage, including a detailed (but subject to change) list of topics, and links to the compiled course notes. I'll post updates here to the notes for specific lectures and assignments, like last time.

This is the same course I taught last spring, only grown from sixty-odd students to (currently) ninety-three (from 12 different majors!). The smart thing for me to do would probably be to change nothing (I haven't gotten to re-teach a class since 2009), but I felt the urge to re-organize the material and squeeze in a few more topics.

The biggest change I am making is introducing some quality-control sampling. The course is to big for me to look over much of the students' work, and even then, that gives me little sense of whether the assignments are really probing what they know (much less helping them learn). So I will be randomly selecting six students every week, to come to my office and spend 10--15 minutes each explaining the assignment to me and answering live questions about it. Even allowing for students being randomly selected multiple times*, I hope this will give me a reasonable cross-section of how well the assignments are working, and how well the grading tracks that. But it's an experiment and we'll see how it goes.

* (exercise for the student): Find the probability distribution of the number of times any given student gets selected. Assume 93 students, with 6 students selected per week, and 14 weeks. (Also assume no one drops the class.) Find the distribution of the total number of distinct students who ever get selected.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 03, 2012 23:00 | permanent link

January 01, 2012

End of Year Inventory, 2011

Attention conservation notice: Navel-gazing.

Paper manuscripts completed: 12
Papers accepted: 2 [i, ii], one from last year
Papers rejected: 10 (fools! I'll show you all!)
Papers rejected with a comment from the editor that no one should take the paper I was responding to, published in the same glossy high-impact journal, "literally": 1
Papers in refereeing limbo: 4
Papers in progress: I won't look in that directory and you can't make me

Grant proposals submitted: 3
Grant proposals rejected: 4 (two from last year)
Grant proposals in refereeing limbo: 1
Grant proposals in progress for next year: 3

Talk given and conferences attended: 20, in 14 cities

Manuscripts refereed: 46, for 18 different journals and conferences
Manuscripts waiting for me to referee: 7
Manuscripts for which I was the responsible associate editor at Annals of Applied Statistics: 10
Book proposals reviewed: 3

Classes taught: 2
New classes taught: 2
Summer school classes taught: 1
New summer school classes taught: 1
Pages of new course material written: about 350

Students who are now ABD: 1
Students who are not just ABD but on the job market: 1

Letters of recommendation written: 8 (with about 100 separate destinations)

Promotion packets submitted: 1 (for promotion to associate professor, but without tenure)
Promotion cases still working through the system: 1

Book reviews published on dead trees: 2 [i, ii]
Non-book-reviews published on dead trees: 1

Weblog posts: 157
Substantive weblog posts: 54, counting algal growths

Books acquired: 298
E-book readers gratefully received: 1
Books driven by my mother from her house to Pittsburgh: about 800
Books begun: 254
Books finished: 204 (of which 34 on said e-book reader)
Books given up: 16
Books sold: 133
Books donated: 113

Book manuscripts completed: 0

Wisdom teeth removed: 4
Unwise teeth removed: 1

Major life transitions: 0

Self-Centered

Posted by crshalizi at January 01, 2012 12:00 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems