November 24, 2009

(I Don't Care About My) Bad Reputation

Attention conservation notice: Unskillful nattering about pop-culture ephemera.
For the sake of my own sanity, I prefer to remain ignorant of the occult processes by which the direct mail gods decide to which catalogues to send to which people. (There's too much dynamic programming involved.) Today, for instance, they decided to inflict upon me the official Barbie doll spring 2010 collection preview, and like a fool I couldn't resist looking through it. Thus my life is made that much worse by learning that there is a Joan Jett Barbie doll. (I thought about embedding an image, but in this case pain shared is not pain eased.) I think I finally grasp what people mean when they talk about later cultural products assaulting parts of their childhood, in this case one I didn't even realize I valued.

Linkage

Posted by crshalizi at November 24, 2009 19:27 | permanent link

November 21, 2009

"Homophily, Contagion, Confounding: Pick Any Three"

A number of people have asked for my slides from the MERSIH conference the other week. So, here they are. (Anyone who was at my talk at SFI about a year ago will recognize the title, and much of the content.) I'm presently turning this into a proper manuscript, so comments are welcome. Please don't rip it off; I'll become very cross and may even hold my breath until I turn blue and pass out, and won't you be sorry then?

Networks; Enigmas of Chance; Complexity; Self-Centered

Posted by crshalizi at November 21, 2009 18:32 | permanent link

November 19, 2009

"Statistical Analysis of Stellar Evolution" (Next Week at the Statistics Seminar)

In which the starry heavens above submit to statistical analysis:

David van Dyk, "Statistical Analysis of Stellar Evolution"
Abstract: Color-Magnitude Diagrams (CMDs) are plots that compare the magnitudes (luminosities) of stars in different wavelengths of light (colors). High non-linear correlations among the mass, color and surface temperature of newly formed stars induce a long narrow curved point cloud in a CMD known as the main sequence. Aging stars form new CMD groups of red giants and white dwarfs. The physical processes that govern this evolution can be described with mathematical models and explored using complex computer models. These calculations are designed to predict the plotted magnitudes as a function of parameters of scientific interest such as stellar age, mass, and metallicity. Here, we describe how we use the computer models as a component of a complex likelihood function in a Bayesian analysis that requires sophisticated computing, corrects for contamination of the data by field stars, accounts for complications caused by unresolved binary-star systems, and aims to compare competing physics-based computer models of stellar evolution.
This is joint work with Steven DeGennaro, Nathan Stein, William H. Jefferys, Ted von Hippel, and Elizabeth Jeffery.
Place and time: Doherty Hall A310, Monday, 23 November, 4--5 pm.

Enigmas of Chance; The Eternal Silence of These Infinite Spaces; Physics

Posted by crshalizi at November 19, 2009 12:02 | permanent link

November 13, 2009

"Some Things Statisticians Do at Google" (Next Week at the Statistics Seminar)

Attention conservation notice: Of no use to you unless (1) you want to know what statisticians do at search-engine companies and (2) you are in Pittsburgh.
Mike Meyer, "Some Things Statisticians Do at Google"
Abstract: I'll talk about a number of projects at Google where statisticians have made a large contribution. There will not be a lot of technical details. In some cases I will just describe the problem.
The major example will be a description of the statistical and engineering infrastructure to support live traffic experiments at Google.
A common theme of the problems is the importance of understanding basic statistical principles that can be applied and modified to handle new data and new circumstances.
Place and time: Monday, 16 November at 4 pm, in Doherty Hall A310

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at November 13, 2009 15:09 | permanent link

November 08, 2009

The Shadow Price of Power

Attention conservation notice: Quasi-teaching note giving an economic interpretation of the Neyman-Pearson lemma on statistical hypothesis testing.

Suppose we want to pick out some sort of signal from a background of noise. As every schoolchild knows, any procedure for doing this, or test, divides the data space into two parts, the one where it says "noise" and the one where it says "signal".* Tests will make two kinds of mistakes: they can can take noise to be signal, a false alarm, or can ignore a genuine signal as noise, a miss. Both the signal and the noise are stochastic, or we can treat them as such anyway. (Any determinism distinguishable from chance is just insufficiently complicated.) We want tests where the probabilities of both types of errors are small. The probability of a false alarm is called the size of the test; it is the measure of the "say 'signal'" region under the noise distribution. The probability of a miss, as opposed to a false alarm, has no short name in the jargon, but one minus the probability of a miss — the probability of detecting a signal when it's present — is called power.

Suppose we know the probability density of the noise p and that of the signal is q. The Neyman-Pearson lemma, as many though not all schoolchildren know, says that then, among all tests off a given size s, the one with the smallest miss probability, or highest power, has the form "say 'signal' if q(x)/p(x) > t(s), otherwise say 'noise'," and that the threshold t varies inversely with s. The quantity q(x)/p(x) is the likelihood ratio; the Neyman-Pearson lemma says that to maximize power, we should say "signal" if its sufficiently more likely than noise.

The likelihood ratio indicates how different the two distributions — the two hypotheses — are at x, the data-point we observed. It makes sense that the outcome of the hypothesis test should depend on this sort of discrepancy between the hypotheses. But why the ratio, rather than, say, the difference q(x) - p(x), or a signed squared difference, etc.? Can we make this intuitive?

Start with the fact that we have an optimization problem under a constraint. Call the region where we proclaim "signal" R. We want to maximize its probability when we are seeing a signal, Q(R), while constraining the false-alarm probability, P(R) = s. Lagrange tells us that the way to do this is to minimize Q(R) - t[P(R) - s] over R and t jointly. So far the usual story; the next turn is usually "as you remember from the calculus of variations..."

Rather than actually doing math, let's think like economists. Picking the set R gives us a certain benefit, in the form of the power Q(R), and a cost, tP(R). (The ts term is the same for all R.) Economists, of course, tell us to equate marginal costs and benefits. What is the marginal benefit of expanding R to include a small neighborhood around the point x? Just, by the definition of "probability density", q(x). The marginal cost is likewise tp(x). We should include x in R if q(x) > tp(x), or q(x)/p(x) > t. The boundary of R is where marginal benefit equals marginal cost, and that is why we need the likelihood ratio and not the likelihood difference, or anything else. (Except for a monotone transformation of the ratio, e.g. the log ratio.) The likelihood ratio threshold t is, in fact, the shadow price of statistical power.

I am pretty sure I have not seen or heard the Neyman-Pearson lemma explained marginally before, but in retrospect it seems too simple to be new, so pointers would be appreciated.

Manual trackback: John Barrdear

Updates: Thanks to David Kane for spotting a typo.

*: Yes, you could have a randomized test procedure, but the situations where those actually help pretty much define "boring, merely-technical complications."

Enigmas of Chance

Posted by crshalizi at November 08, 2009 03:06 | permanent link

November 05, 2009

36-350, Data Mining: Course Materials (Fall 2009)

My lesson-plan having survived first contact with the enemy students, it's time to start posting the lecture handouts & c. This page will be updated as the semester goes on; the RSS feed for it should be here. The class homepage has more information.

  1. Introduction to the course (24 August) What is data mining? how is it used? where did it come from? Some themes.
  2. Information retrieval and similarity searching I (26 August) Finding the data you are looking for. Ideas we will avoid: meta-data and cataloging; meanings. Textual features. The bag-of-words representation; its vector form. Measuring similarity and distance for vectors. Example with the New York Times Annotated Corpus.
  3. IR continued (28 August). The trick to searching: queries are documents. Search evaluation: precision, recall, precision-recall curves; error rates. Classification: nearest neighbors and prototypes; classifier evaluation by mis-classification rate and by confusion matrices. Inverse document frequency weighting. Visualizing high-dimensional data by multi-dimensional scaling. Miscellaneous topics: stemming, incorporating user feedback.

    Homework 1, due 4 September: assignment, R, data; SOLUTIONS

  4. Page Rank (31 August). Links as pre-existing feedback. How to exploit link information? The random walk on the graph; using the ergodic theorem. Eigenvector formulation of page-rank. Combining page-rank with textual features. Other applications. Further reading on information retrieval.
  5. Image Search, Abstraction and Invariance (2 September). Similarity search for images. Back to representation design. The advantages of abstraction: simplification, recycling. The bag-of-colors representation. Examples. Invariants. Searching for images by searching text. An example in practice. Slides for this lecture.
  6. Information Theory I (4 September). Good features help us guess what we can't represent. Good features discriminate between different values of unobserved variables. Quantifying uncertainty with entropy. Quantifying reduction in uncertainty/ discrimination with mutual information. Ranking features based on mutual information. Examples, with code, of informative words for the Times. Code.
    Supplementary reading: David P. Feldman, Brief Tutorial on Information Theory, chapter 1

    Homework 2, due 11 September: assignment; SOLUTIONS TEXT; SOLUTIONS R

  7. Information Theory II (9 September). Dealing with multiple features. Joint entropy, the chain rule for entropy. Information in multiple features. Conditional information, chain rule for information, conditional independence. Interactions, positive and negative, and redundancy. Greedy feature selection with low redundancy. Example, with code, of selecting words for the Times. Sufficient statistics and the information bottleneck. Code.
    Supplementary reading; Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", arxiv:cs.AI/0308002
  8. Categorization; Clustering I (11 September). Dividing the world up into categories. Classification: known categories with labeled examples. Taxonomy of learning problems (supervised, unsupervised, semi-supervised, feedback, ...). Clustering: discovering unknown categories from unlabeled data. Benefits of clustering, with an digression on where official classes come from. Basic criterion for good clusters: lots of information about features from little information about cluster. Practical considerations: compactness, separation, parsimony, balance. Doubts about parsimony and balance. The k-means clustering algorithm, or unlabeled prototype classification: analysis, geometry, search. Appendix: geometric aspects of the prototype and nearest-neighbor method.

    Homework 3, due 18 September: assignment; SOLUTIONS

  9. Clustering II (14 September). Distances between partitions; variation-of-information distance. Hierarchical clustering by agglomeration and its varieties. Picking the number of clusters by merging costs. Performance of different clustering methods on various doodles. Why we would like to pick the number of clusters by predictive performance, and why it is hard to do at this stage. Reifying clusters.
  10. Transformations: Rescaling and Low-Dimensional Summaries (16 September). Improving on our original features. Re-scaling, standardization, taking logs, etc., of individual features. Forcing things to be Gaussian considered harmful. Low-dimensional summaries by combining features. Exploiting geometry to eliminate redundancy. Projections on to linear subspaces. Searching for structure-preserving projections.
  11. Principal Components I (18 September). Principal components are the directions of maximum variance. Derivation of principal components as the best approximation to the data in a linear subspace. Equivalence to variance maximization. Avoiding explicit optimization by finding eigenvalues and eigenvectors of the covariance matrix. Example of principal components with cars; how to tell a sports car from a minivan. The standard recipe for doing PCA. Cautions in interpreting PCA. Data-set used in the notes.

    Homework 4, due 25 September: assignment; SOLUTIONS

  12. Principal Components II (21 September). PCA + information retrieval = latent semantic indexing; why LSI is a Good Idea. PCA and multidimensional scaling.
  13. Factor Analysis (23 and 25 September). From PCA to factor analysis by adding noise. Roots of factor analysis in causal discovery: Spearman's general factor model and the tetrad equations. Problems with estimating factor models: number of equations does not equal number of unknowns. Solution 1, "principal factors", a.k.a. estimation through heroic feats of linear algebra. Solution 2, maximum likelihood, a.k.a. estimation through imposing distributional assumptions. The rotation problem: the factor model is unidentifiable; the number of factors may be meaningful, but the individual factors are not.
  14. The Truth about PCA and Factor Analysis (28 September) PCA is data reduction without any probabilistic assumptions about where the data came from. Picking number of components. Faking predictions from PCA. Factor analysis makes stronger, probabilistic assumptions, and delivers stronger, predictive conclusions --- which could be wrong. Using probabilistic assumptions and/or predictions to pick how many factors. Factor analysis as a first, toy instances of a graphical causal model. The rotation problem once more with feeling. Factor models and mixture models. Factor models and Thomson's sampling model: an outstanding fit to a model with a few factors is actually evidence of a huge number of badly measured latent variables. Final advice: it all depends, but if you can only do one, try PCA. R code for the Thomson sampling model.
  15. Nonlinear Dimensionality Reduction I: Locally Linear Embedding (5 October). Failure of PCA and all other linear methods for nonlinear structures in data; spirals, for example. Approximate success of linear methods on small parts of nonlinear structures. Manifolds: smoothly curved surfaces embedded in higher-dimensional Euclidean spaces. Every manifold looks like a linear subspace on a sufficiently small scale, so we should be able to patch together many small local linear approximations into a global manifold. Local linear embedding: approximate each vector in the data as a weighted linear combination of its k nearest neighbors, then find the low-dimensional vectors best reconstructed by these weights. Solving the optimization problems by linear algebra. Coding up LLE. A spiral rainbow. R.
  16. Nonlinear Dimensionality Reduction II: Diffusion Maps (9 October). Making a graph from the data; random walks on this graph. The diffusion operator, a.k.a. Laplacian. How the Laplacian encodes the shape of the data. Eigenvectors of the Laplacian as coordinates. Connection to page-rank. Advantages when data are not actually on a manifold. Example.

    Pre-midterm review (12 October): highlights of the course to date; no handout.
    MIDTERM (14 October): exam, solutions

    Homework 5, due 23 October: assignment; solutions

  17. Regression I: Basics. Guessing a real-valued random variable; why expectation values are mean-square optimal point forecasts. The regression function; why its estimation must involve assumptions beyond the data. The bias-variance decomposition and the bias-variance trade-off. First example of improving prediction by introducing variance. Ordinary least squares linear regression as smoothing. Other linear smoothers: k-nearest-neighbors and kernel regression. How much should we smooth? R, data for running example
  18. Regression II: The Truth About Linear Regression (21 October). Linear regression is optimal linear (mean-square) prediction; we do this because we hope a linear approximation will work well enough over a small range. What linear regression does: decorrelate the input features, then correlate them separately with the response and add up. The extreme weakness of the probabilistic assumptions needed for this to make sense. Difficulties of linear regression; collinearity, errors in variables, shifting distributions of inputs, omitted variables. The usual extra probabilistic assumptions and their implications. Why you should always looking at residuals. Why you generally shouldn't use regression for causal inference. How to torment angels. Likelihood-ratio tests for restrictions of nice models.
  19. Regression III: Extending Linear Regression (23 October). Weighted least squares. Heteroskedasticity: variance is not the same everywhere. Going to consult the oracle. Weighted least squares as a solution to heteroskedasticity. Nonparametric estimation of the variance function. Local polynomial regression: local constants (= kernel regression), local linear regression, higher-order local polynomials. Lowess = locally-linear smoothing for scatter plots. The oracles fall silent.

    Homework 6, due Friday, 30 October: assignment, data set; solutions

  20. Evaluating Predictive Models (26 and 28 October). In-sample, out-of-sample and generalization loss or error; risk as expected loss on new data. Under-fitting, over-fitting, and examples with polynomials. Methods of model selection and controlling over-fitting: empirical risk minimization, penalization, constraints/sieves, formal learning theory, cross-validation. Limits of generalization. R for creating figures.
  21. Smoothing Methods in Regression (30 October). How much smoothing should we do? Approximation by local averaging. How much smoothing we should do to find the unknown curve depends on how smooth the curve really is, which is unknown. Adaptation as a partial substitute for actual knowledge. Cross-validation for adapting to unknown smoothness. Application: testing parametric regression models by comparing them to nonparametric fits. The bootstrap principle. Why ever bother with parametric regressions? R code for some of the examples.

    Homework 7, due Friday, 6 November: assignment

  22. Additive Models (2 November). A nice feature of linear models: partial responses, partial residuals, and backfitting estimations. Additive models: regression curve is a sum of partial response functions; partial residuals and the backfitting trick generalize. Parametric and non-parametric rates of convergence. The curse of dimensionality for unstructured nonparametric models. Additive models as a compromise, introducing bias to reduce variance. Example with the data from homework 6.
  23. Classification and Regression Trees (4 and 6 November). Prediction trees. A classification tree we can believe in. Prediction trees combine simple local models with recursive partitioning; adaptive nearest neighbors. Regression trees: example; a little math; pruning by cross-validation; more R mechanics. Classification trees: basics; measuring error by mis-classification; weighted errors; likelihood; Neyman-Pearson classifiers. Uncertainty for trees.

Corrupting the Young; Enigmas of Chance

Posted by crshalizi at November 05, 2009 22:45 | permanent link

November 04, 2009

Blosxom Fading in November

My old Blosxom installation (v. 2.0.2), after several years of working nicely, is growing increasingly cranky, and mulishly refusing to generate or update posts as the whim takes it. (I am not sure how much kicking and shoving it will need to produce this.) I'd appreciate a pointer to something which works similarly, but does work: I write posts in plain HTML in Emacs and drop them in a directory; it makes them look nice. If it handles tags and/or LaTeX nicely, so much the better.

Self-Centered

Posted by crshalizi at November 04, 2009 19:34 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems