January 27, 2012

Changing How Changes Change (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about covariance matrices and (2) will be in Pittsburgh on Monday.

Since so much of multivariate statistics depends on patterns of correlation among variables, it is a bit awkward to have to admit that in lots of practical contexts, correlations matrices are just not very stable, and can change quite drastically. (Some people pay a lot to rediscover this.) It turns out that there are more constructive responses to this situation than throwing up one's hands and saying "that sucks", and on Monday a friend of the department and general brilliant-type-person will be kind enough to tell us about them:

Emily Fox, "Bayesian Covariance Regression and Autoregression"
Abstract: Many inferential tasks, such as analyzing the functional connectivity of the brain via coactivation patterns or capturing the changing correlations amongst a set of assets for portfolio optimization, rely on modeling a covariance matrix whose elements evolve as a function of time. A number of multivariate heteroscedastic time series models have been proposed within the econometrics literature, but are typically limited by lack of clear margins, computational intractability, and curse of dimensionality. In this talk, we first introduce and explore a new class of time series models for covariance matrices based on a constructive definition exploiting inverse Wishart distribution theory. The construction yields a stationary, first-order autoregressive (AR) process on the cone of positive semi-definite matrices.
We then turn our focus to more general predictor spaces and scaling to high-dimensional datasets. Here, the predictor space could represent not only time, but also space or other factors. Our proposed Bayesian nonparametric covariance regression framework harnesses a latent factor model representation. In particular, the predictor-dependent factor loadings are characterized as a sparse combination of a collection of unknown dictionary functions (e.g., Gaussian process random functions). The induced predictor-dependent covariance is then a regularized quadratic function of these dictionary elements. Our proposed framework leads to a highly-flexible, but computationally tractable formulation with simple conjugate posterior updates that can readily handle missing data. Theoretical properties are discussed and the methods are illustrated through an application to the Google Flu Trends data and the task of word classification based on single-trial MEG data.
Time and place: 4--5 pm on Monday, 30 January 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at January 27, 2012 14:25 | permanent link

January 26, 2012

Smoothing Methods in Regression (Advanced Data Analysis from an Elementary Point of View)

The constructive alternative to complaining about linear regression is non-parametric regression. There are many ways to do this, but we will focus on the conceptually simplest one, which is smoothing; especially kernel smoothing. All smoothers involve local averaging of the training data. The bias-variance trade-off tells us that there is an optimal amount of smoothing, which depends both on how rough the true regression curve is, and on how much data we have; we should smooth less as we get more information about the true curve. Knowing the truly optimal amount of smoothing is impossible, but we can use cross-validation to select a good degree of smoothing, and adapt to the unknown roughness of the true curve. Detailed examples. Analysis o how quickly kernel regression converges on the truth. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.

Readings: Notes, chapter 4 (R); Faraway, section 11.1

Optional readings: Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2012 10:30 | permanent link

Advantages of Backwardness (Advanced Data Analysis from an Elementary Point of View)

In which we try to discern whether poor countries grow faster.

Assignment, R, penn-select.csv data set

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2012 09:30 | permanent link

January 24, 2012

Model Evaluation: Error and Inference (Advanced Data Analysis from an Elementary Point of View)

Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection. Justifying model-based inferences; Luther and Süleyman.

Reading: Notes, chapter 3 (R for examples and figures).

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 24, 2012 10:30 | permanent link

The Truth About Linear Regression (Advanced Data Analysis from an Elementary Point of View)

Multiple linear regression: general formula for the optimal linear predictor. Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions. Linear regression coefficients will change with the distribution of the input variables: examples. Why R2 is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable problems). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means.

Reading: Notes, chapter 2 (R for examples and figures); Faraway, chapter 1 (continued).

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 24, 2012 10:15 | permanent link

January 22, 2012

Dungeons and Debtors

Attention conservation notice: A silly idea about gamifying credit cards, which would be evil if it worked.

To make a profit in an otherwise competitive industry, it helps if you can impose switching costs on your customers, making them either pay to stop doing business with you, or give up something of value to them. There are whole books about this, written by respected economists1.

This is why credit card companies are happy to offer rewards for use: accumulating points on a card, which would not move with you if you got a new card and transferred the balance, is an attempt to create switching costs. Unfortunately, from the point of view of the banks, people will redeem their points from time to time, so some money must be spent on the rewards. The ideal would be points which people would value but which would never cost the bank anything.

Item: Computer games are, deliberately, addictive. Social games are especially addictive.

Accordingly, if I were an evil and unscrupulous credit card company (but I repeat myself), I would create an online game, where people could get points either from playing the game, or from spending money with my credit card. For legal reasons, I think it would probably be best to allow the game to technically be open to everyone, but with a registration fee which is, naturally, waived for card-holders. Of course, the game software would be set up to announce on Facebook (etc.) whenever the player/debtor leveled up. I would also be tempted to award double points for fees, and triple for interest charges, but one could experiment with this. If they close their credit card account, they have to start the game over from the beginning.

The fact that online acquaintances can't tell whether the debtor is advancing through spending or through game-play helps keep the reward points worth having. It's true that the credit card company has to pay for the game's design (a one-time start-up cost) and the game servers, but these are fairly cheap, and the bank never has to cash out points in actual dollars or goods. The debtors themselves do all the work of investing the points with meaning and value. They impose the switching costs on themselves.

My plan is sheer elegance in its simplicity, and I will be speaking to an attorney about a business method patent first thing Monday.

1: Much can be learned about our benevolent new-media overlords from the fact that this book carries a blurb from Jeff Bezos of Amazon, and that Varian now works for Google.

Modest Proposals;

Posted by crshalizi at January 22, 2012 10:15 | permanent link

January 17, 2012

"Can't seem to face up to the facts"

Attention conservation notice: An academic paper you've never heard of, about a distressing subject, had bad statistics and is generally foolish.

Because my so-called friends like to torment me, several of them made sure that I knew a remarkably idiotic paper about power laws was making the rounds, promoted by the ignorant and credulous, with assistance from the credulous and ignorant, supported by capitalist tools:

M. V. Simkin and V. P. Roychowdhury, "Stochastic modeling of a serial killer", arxiv:1201.2458
Abstract: We analyze the time pattern of the activity of a serial killer, who during twelve years had murdered 53 people. The plot of the cumulative number of murders as a function of time is of "Devil's staircase" type. The distribution of the intervals between murders (step length) follows a power law with the exponent of 1.4. We propose a model according to which the serial killer commits murders when neuronal excitation in his brain exceeds certain threshold. We model this neural activity as a branching process, which in turn is approximated by a random walk. As the distribution of the random walk return times is a power law with the exponent 1.5, the distribution of the inter-murder intervals is thus explained. We confirm analytical results by numerical simulation.

Let's see if we can't stop this before it gets too far, shall we? The serial killer in question is one Andrei Chikatilo, and that Wikipedia article gives the dates of death of his victims, which seems to have been Simkin and Roychowdhury's data source as well. Several of these are known only imprecisely, so I made guesses within the known ranges; the results don't seem to be very sensitive to the guesses. Simkin and Roychowdhury plotted the distribution of days between killings in a binned histogram on a logarithmic scale; as we've explained elsewhere, this is a bad idea, which destroys information to no good purpose, and a better display is shows the (upper or complementary) cumulative distribution function1, which looks like so:

When I fit a power law to this by maximum likelihood, I get an exponent of 1.4, like Simkin and Roychowdhury; that looks like this:

Update: The 95% (bootstrap) confidence interval for the exponent is (1.35,1.48), which you will notice excludes 1.5.

On the other hand, when I fit a log-normal (because Gauss is not mocked), we get this:

After that figure, a formal statistical test is almost superfluous, but let's do it anyway, because why just trust our eyes when we can calculate? The data are better fit by the log-normal than by the power-law (the data are e10.41 or about 33 thousand times more likely under the former than the latter), but that could happen via mere chance fluctuations, even when the power law is right. Vuong's model comparison test lets us quantify that probability, and tells us a power-law would produce data which seems to fit a log-normal this well no more than 0.4 percent2 of the time. Not only does the log-normal distribution fit better than the power-law, the difference is so big that it would be absurd to try to explain it away as bad luck. In absolute terms, we can find the probability of getting as big a deviation between the fitted power law and the observed distribution through sampling fluctuations, and it's about 0.03 percent2b [R code for figures, estimates and test, including data.]

Since Simkin and Roychowdhury's model produces a power law, and these data, whatever else one might say about them, are not power-law distributed, I will refrain from discussing all the ways in which it is a bad model. I will re-iterate that it is an idiotic paper — which is different from saying that Simkin and Roychowdhury are idiots; they are not and have done interesting work on, e.g., estimating how often references are copied from bibliographies without being read by tracking citation errors4. But the idiocy in this paper goes beyond statistical incompetence. The model used here was originally proposed for the time intervals between epileptic fits. The authors realize that

[i]t may seem unreasonable to use the same model to describe an epileptic and a serial killer. However, Lombroso [5] long ago pointed out a link between epilepsy and criminality.
That would be the 19th-century pseudo-scientist3 Cesare Lombroso, who also thought he could identify criminals from the shape of their skulls; for "pointed out", read "made up". Like I said: idiocy.

As for the general issues about power laws and their abuse, say something once, why say it again?

Update 9 pm that day: Added the goodness-of-fit test (text before note 2b, plus that note), updated code, added PNG versions of figures, added attention conservation notice.
21 January: typo fixes (missing pronoun, mis-placed decimal point), added bootstrap confidence interval for exponent, updated code accordingly.

Manual trackback: Hacker News (do I really need to link to this?), Naked Capitalism (?!); Mathbabe; Wolfgang Beirl; Ars Mathematica (yes, I am that predictable)

1: This is often called the "survival function", but that seems inappropriate here.

2: On average, the log-likelihood of each observation was 0.20 higher under the log-normal than under the power law, and the standard deviation of the log likelihood ratio over the samples was only 0.54. The test statistic thus comes out to -2.68, and the one-sided p-value to 0.36%.

2b: Use a Kolmogorov-Smirnov test. Since the power law has a parameter estimated from data (namely, the exponent), we can't just plug in to the usual tables for a K-S test, but we can find a p-value by simulating the power law (as in my paper with Aaron and Mark), and when I do that, with a hundred thousand replications, the p-value is about 3*10-4.

3: There are in fact subtle, not to say profound, issues in the sociology and philosophy of science here: was Lombroso always a pseudo-scientist, because his investigations never came up to any acceptable standard of reliable inquiry? Or just because they didn't come up to the standards of inquiry prevalent at the time he wrote? Or did Lombroso become a pseudo-scientist, when enough members of enough intellectual communities woke up from the pleasure of having their prejudices about the lower orders echoed to realize that he was full of it? However that may be, this paper has the dubious privilege of being the first time I have ever seen Lombroso cited as an authority rather than a specimen.

4: Actually, for several years my bibliography data base had the wrong page numbers for one of my own papers, due to a typo, so their method would flag some of my subsequent works as written by someone who had cited that paper without reading it, which I assure you was not the case. But the idea seems reasonable in general.

Power Laws; Learned Folly

Posted by crshalizi at January 17, 2012 20:23 | permanent link

What's That Got to Do with the Price of Condos in California? (Advanced Data Analysis from an Elementary Point of View)

In which we practice the art of linear regression upon the California real-estate market, by way of warming up for harder tasks.

Assignment, data set

(Yes, the data set is now about as old as my students, but last week in Austin I was too busy drinking on 6th street having lofty conversations about the future of statistics to update the file with the UScensus2000 package.)

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 17, 2012 10:31 | permanent link

Regression: Predicting and Relating Quantitative Features (Advanced Data Analysis from an Elementary Point of View)

Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.

Readings: Notes, chapter 1; Faraway, chapter 1, through page 17.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 17, 2012 10:30 | permanent link

January 07, 2012

Mail Woes

If you sent me e-mail at my @stat.cmu.edu address in the last few days, I haven't gotten it, and may never get it. The address firstinitiallastname at cmu dot edu now points somewhere where I can read.

Posted by crshalizi at January 07, 2012 20:40 | permanent link

January 06, 2012

Sloth in Austin

I'll be speaking at UT-Austin next week, through the kindness of the division of statistics and scientific computation:

"When Can We Learn Network Models from Samples?"
Abstract: Statistical models of network structure are models for the entire network, but the data are typically just a sampled sub-network. Parameters for the whole network, which are what we care about, are estimated by fitting the model on the sub-network. This assumes that the model is "consistent under sampling" (forms a projective family). For the widely-used exponential random graph models (ERGMs), this trivial-looking condition is violated by many popular and scientifically appealing models; satisfying it drastically limits ERGMs' expressive power. These results are special cases of more general ones about exponential families of dependent variables, which we also prove. As a consolation prize, we offer easily checked conditions for the consistency of maximum likelihood estimation in ERGMs, and discuss some possible constructive responses.
Time and place: 2--3 pm on Wednesday, 11 January 2012, in Hogg Building (WCH), room 1.108

This will of course be based on my paper with Alessandro, but since I understand some non-statisticians may sneak in, I'll try to be more comprehensible and less technical.

Since this will be my first time in Austin (indeed my first time in Texas), and I have (for a wonder) absolutely no obligations on the 12th, suggestions on what I should see or do would be appreciated.

Self-Centered

Posted by crshalizi at January 06, 2012 14:15 | permanent link

January 03, 2012

Course Announcement: Advanced Data Analysis from an Elementary Point of View

It's that time again:

36-402, Advanced Data Analysis, Spring 2012
Description: This course introduces modern methods of data analysis, building on the theory and application of linear models from 36-401. Topics include nonlinear regression, nonparametric smoothing, density estimation, generalized linear and generalized additive models, simulation and predictive model-checking, cross-validation, bootstrap uncertainty estimation, multivariate methods including factor analysis and mixture models, and graphical models and causal inference. Students will analyze real-world data from a range of fields, coding small programs and writing reports.
Prerequisites: 36-401 (modern regression); or consent of instructor, in extraordinary cases
Time and place: 10:30--11:50 am, Tuesdays and Thursdays, in Porter Hall 100
Note: Graduate students in other departments wishing to take this course for credit need consent of the instructor, and should register for 36-608.

Fuller details on the class homepage, including a detailed (but subject to change) list of topics, and links to the compiled course notes. I'll post updates here to the notes for specific lectures and assignments, like last time.

This is the same course I taught last spring, only grown from sixty-odd students to (currently) ninety-three (from 12 different majors!). The smart thing for me to do would probably be to change nothing (I haven't gotten to re-teach a class since 2009), but I felt the urge to re-organize the material and squeeze in a few more topics.

The biggest change I am making is introducing some quality-control sampling. The course is to big for me to look over much of the students' work, and even then, that gives me little sense of whether the assignments are really probing what they know (much less helping them learn). So I will be randomly selecting six students every week, to come to my office and spend 10--15 minutes each explaining the assignment to me and answering live questions about it. Even allowing for students being randomly selected multiple times*, I hope this will give me a reasonable cross-section of how well the assignments are working, and how well the grading tracks that. But it's an experiment and we'll see how it goes.

* (exercise for the student): Find the probability distribution of the number of times any given student gets selected. Assume 93 students, with 6 students selected per week, and 14 weeks. (Also assume no one drops the class.) Find the distribution of the total number of distinct students who ever get selected.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 03, 2012 23:00 | permanent link

January 01, 2012

End of Year Inventory, 2011

Attention conservation notice: Navel-gazing.

Paper manuscripts completed: 12
Papers accepted: 2 [i, ii], one from last year
Papers rejected: 10 (fools! I'll show you all!)
Papers rejected with a comment from the editor that no one should take the paper I was responding to, published in the same glossy high-impact journal, "literally": 1
Papers in refereeing limbo: 4
Papers in progress: I won't look in that directory and you can't make me

Grant proposals submitted: 3
Grant proposals rejected: 4 (two from last year)
Grant proposals in refereeing limbo: 1
Grant proposals in progress for next year: 3

Talk given and conferences attended: 20, in 14 cities

Manuscripts refereed: 46, for 18 different journals and conferences
Manuscripts waiting for me to referee: 7
Manuscripts for which I was the responsible associate editor at Annals of Applied Statistics: 10
Book proposals reviewed: 3

Classes taught: 2
New classes taught: 2
Summer school classes taught: 1
New summer school classes taught: 1
Pages of new course material written: about 350

Students who are now ABD: 1
Students who are not just ABD but on the job market: 1

Letters of recommendation written: 8 (with about 100 separate destinations)

Promotion packets submitted: 1 (for promotion to associate professor, but without tenure)
Promotion cases still working through the system: 1

Book reviews published on dead trees: 2 [i, ii]
Non-book-reviews published on dead trees: 1

Weblog posts: 157
Substantive weblog posts: 54, counting algal growths

Books acquired: 298
E-book readers gratefully received: 1
Books driven by my mother from her house to Pittsburgh: about 800
Books begun: 254
Books finished: 204 (of which 34 on said e-book reader)
Books given up: 16
Books sold: 133
Books donated: 113

Book manuscripts completed: 0

Wisdom teeth removed: 4
Unwise teeth removed: 1

Major life transitions: 0

Self-Centered

Posted by crshalizi at January 01, 2012 12:00 | permanent link

December 31, 2011

Books to Read While the Algae Grow in Your Fur, December 2011

Attention conservation notice: I have no taste.

Andrea Camilleri, The Wings of the Sphinx; The Track of Sand; The Potter's Field
Delightful as always, though tinged with melancholy, because Montalbano is growing old (and making some questionable personal decisions because of it). The Track of Sand is perhaps the least Dick Francis-like mystery involving horse-racing I have run across.
Peter Bühlmann and Sara van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications
(My mini-review has grown to a few thousand words, complete with figures, equations, and R, so I'll throttle down, and link to the review when I'm finished. In the meanwhile, a book report.)
This is a sound, thorough and reliable guide to what we currently know about linear (generalized linear, additive...) modeling in the high-dimensional regime where the number of adjustable parameters is much larger than the number of observations. The bulk of the book (chapters 2--9) is about the lasso (L1 penalization) and closely related methods. Chapters 2--5 and 9 are largely methodological; the theory comes in chapters 6--8, which are concerned with predictive accuracy, parametric consistency, and variable selection. These theoretical chapters make extensive use of empirical process techniques, which is not surprising considering that van de Geer wrote the book on empirical process theory in estimation. Chapter 14, really a kind of appendix, collects the necessary concepts and results from empirical process theory proper; it is formally self-contained, but probably some prior exposure would be helpful.
Chapters 10 and 11 turn consider issues of stability and statistical significance in variable selection, closely following recent work by Bühlmann and collaborators. Chapter 12 is a very nice treatment of boosting, where one uses an ensemble of highly-biased and low-capacity, but very stable, models to compensate for each other's faults. Chapter 13, finally, turns to graphical models, especially Gaussian graphical models, looking at ways of inferring the graph based on the lasso principle, on local regression, and, even more closely, the PC algorithm of P. Spirtes and C. Glymour. (This chapter draws on work by work by Kalisch and Bühlmann on how the PC algorithm works in the high-dimensional regime.) Causal inference is an important application of graphical models, but it is, perhaps wisely, not discussed.
The core chapters (6--8) are much rougher going than the more method-oriented ones, but that's just the nature of the material. (Incidentally, the stark contrast between the tools and concepts used in this book and what one finds in, say, Casella and Berger is a good illustration of how theoretical statistics has been shaped by intuitions about low-dimensional problems which serve us poorly in the high-dimensional regime.) I know of no better, more up-to-date summary of current theoretical knowledge about high-dimensional regression, and how it connects to practical methods. It could be used as a textbook, but for very advanced students; it's really better suited to self-study. For that, however, I can recommend it highly to anyone with a serious interest in the area.
Disclaimer: both authors are the kind of person who might get asked to review my application for tenure.
Tim Groseclose, Left Turn: How Liberal Media Bias Distorts the American Mind
I will, for my sins, have much more to say about this soon.
Here I will just remark on one point which I had to leave out of the longer piece, for reasons of space. The whole analysis based on models of decision-making by politicians and by media organizations, where they are supposed to get utility, in the strict sense, directly from citing advocacy organizations. Politicians, that is to say, do not shape their speeches with an eye to persuading other legislators, signaling their supporters among voters, signaling their supporters among funders, signaling potential voters or funders, threatening or bargaining with opponents --- nothing except the warm glow of ideological agreement matters to them. (There is such a thing as expressive action, and you can even model parts of it decision-theoretically, but this is not the way.) And yet this gets published in the Quarterly Journal of Economics, when run by those who think "people respond to incentives" is the law and the prophets. What this says about the intellectual and social organization of economics, and its colonies in other social sciences, I will leave to readers to decide.
(No purchase link because I think it's a truly bad book, though I dutifully bought my copy for the exercise.)
Norman Matloff, The Art of R Programming: A Tour of Statistical Software Design
This has been getting a lot of good press on various R blogs, and deservedly so. It is a clear, sound, user-friendly, no-nonsense introduction to programming through R, pitched at someone who has never programmed before (though not too hand-holding for someone who has). Statistical content is largely confined to the most basic sorts of statistical functions and the detailed examples, of which there are many. Unusual and welcome features: the detailed treatment of factors and tables; the chapters on input/output and on string manipulation; the chapter on debugging. (I am not sure how I feel about the chapter on parallelism: it's an important topic, but it feels too specialized for a first book.)
Naturally, I had complaints. Some of these are the inevitable ones about how I wish there'd been more: about simulation; about formulas and automatically manipulating model-fitting routines; about the split/apply/combine pattern; about working with databases and reshaping data. Others are matters of emphasis: I think Matloff is overly accepting of global variables and global assignment, which in my experience with students just makes things much harder to debug, especially once they start working together. My biggest beef is that Matloff is so focused on the nuts and bolts that he says very little about design principles — that is, about the art of programming. He certainly understands those principles, he even hints at them in the chapter on debugging, but a student would be really lucky to induce them from the book.
Still, while this is not a perfect fit for my highly specific needs, I wish it had been available in time to assign this fall. I will certainly assign it the next time I teach that class — unless a rival publisher offers a truly striking bribe something better comes out in the meanwhile.
(Another attraction of Matloff's book, as a textbook, is that it is so cheap. There is even a free PDF draft from September 2009; I haven't checked how much this differs from the published book.)
Madeleine E. Robins, The Sleeping Partner
Mind candy: very slightly alternate-history Regency England private-eye detection. It's a sequel to Point of Honour and Petty Treason. Please go out and buy all three, so that Robins will keep writing them.
Kage Baker, The Bird of the River
Baker's first two fantasy novels set in this world, The Anvil of the World and The House of the Stag, were funny, exciting, well-told. They also had an astonishing quality of contrivance, of every little detail locking together in a single intricate mechanism. Unless I have missed a lot (which is possible), this is merely a well-told fantasy novel which is also about various forms of growing up, and not Baker giving a bravura performance in the role of Providence. There may be a message in this. (Sadly, she died in 2010, far too soon, and there will not be any more of these.)
Matthew Restall and Amara Solari, 2012 and the End of the World: The Western Roots of the Maya Apocalypse
A brief yet thorough and comprehensive debunking of the idea that ancient Maya thought the world would end of 21 December 2012. Really, however, this is used as an excuse for introducing Maya civilization, the Western apocalyptic tradition, and how the latter was blended into the former after the Conquest. (They do not, sadly from my point of view, go very deeply into the history of modern 2012-ology.) Fast-paced, very clear, and far more polite to the peddlers of this brand of nonsense than they deserve.
Patrick O'Brian, Treason's Harbour, The Far Side of the World, The Reverse of the Medal
I read these too fast.

Books to Read While the Algae Grow in Your Fur; Pleasures of Detection, Portraits of Crime; The Commonwealth of Letters; Enigmas of Chance; Scientifiction and Fantastica Psychoceramica; Writing for Antiquity; Commit a Social Science; The Running-Dogs of Reaction

Posted by crshalizi at December 31, 2011 23:59 | permanent link

December 20, 2011

Self-Evaluation and Lessons Learned (Introduction to Statistical Computing)

Attention conservation notice: Academic statistico-algorithmic navel-gazing.

With the grading done, but grades not yet posted while we wait for the students to fill out faculty evaluations, it's time to reflect on the class just finished. (Since this is the third time I've done a post like this, I guess it's now one of my traditions.)

Overall, it went a lot better than my worst fears, especially considering this was the first time the class was offered. There was a lot of attrition initially, both from students who had taken a lot of programming, and from students who had done no programming at all. (I was truly surprised by how many students had never used a command-line before.) The ones who stuck around all (I think) learned a lot --- more for those who knew less about programming to start with, naturally. Most of the credit for this goes to Vince, naturally.

Some stuff didn't work well:

  • The labs were too hard to finish in 50 minutes. (Every student who mentioned the labs in their feedback, and that was most of them, complained that they were too short, and that there were too few TAs.) Either the problems need to be made much easier, or we need much more lab time, or we need to ditch labs. (But it would be good to give them immediate feedback on programming...) I am not sure what the right thing to do is.
  • The in-class midterm. This did not probe the student's skills as well as I'd hoped, and the very low scores seem to have depressed morale. (It got curved, of course, so it didn't end up hurting anyone, but still.) Next time, either a take-home midterm, or eliminating the midterm altogether in favor of more weight, and time, on the project.
  • The final projects need more time, and more intermediate feedback for mid-course corrections.
  • Writing problem sets the weekend before they were assigned. (I don't think it will surprise any of The Kids to learn I was doing this, or that Vince was better organized.)
  • If the word "hate" was uttered each nanosecond of the hours I spent wrestling with Blackboard, it would not equal one one-billionth of the hate I feel for that software and its designers at this instant. Unfortunately I don't have a better solution which (i) lets students submit their work electronically, (ii) lets the graders share the work, and (iii) provides a shared gradebook.

Stuff that worked well:

  • Most of the homework assignments, despite the visible seams. Specifically, writing the assignments as (very nearly) a series of tests seems to have helped, and should be pushed further. (I got this idea from Bill Tozier, though he may not recall it.)
  • Teaching testing and top-down design. (Grading should enforce this more in the future.)
  • The data-wrangling topics were a big hit. (Again, all to Vince's credit.)
  • Giving a group project final instead of an in-class or even a take-home exam. (Results speak for themselves.)

Stuff I'd try to do next time:

  • Provide more hints about looking stuff up in The R Cookbook.
  • Require the use of a sensible text-editor from the beginning (and maybe have the first lab be mostly about that, plus introducing the command line). RStudio would probably work, though to be effective I'd have to switch to it myself, and away from R.app + Emacs.
  • Enforce style, naming and commenting conventions even more rigidly than now (especially commenting).
  • Schedule project presentations during the final exam period, so they don't eat into lecture slots. Between that, and not having to kill lecture slots for the midterm exam, the pre-exam review and the post-exam inquest, it should be possible to add back in optimization, more about simulation/Monte Carlo, and even more data manipulation.
  • Clarify expectations at the beginning: students will have to use statistics they already know from the pre-req classes, and to learn new statistics. (It's not as though there aren't plenty of statistics-free programming classes for them to take...)

Over-all assessment: B; promising, but with clear areas for definite improvement.

Obligatory disclaimer: Don't blame Vince, or anyone else, for what I say here.

Introduction to Statistical Computing

Posted by crshalizi at December 20, 2011 09:35 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems