October 28, 2009

36-350, Data Mining: Course Materials (Fall 2009)

My lesson-plan having survived first contact with the enemy students, it's time to start posting the lecture handouts & c. This page will be updated as the semester goes on; the RSS feed for it should be here. The class homepage has more information.

  1. Introduction to the course (24 August) What is data mining? how is it used? where did it come from? Some themes.
  2. Information retrieval and similarity searching I (26 August) Finding the data you are looking for. Ideas we will avoid: meta-data and cataloging; meanings. Textual features. The bag-of-words representation; its vector form. Measuring similarity and distance for vectors. Example with the New York Times Annotated Corpus.
  3. IR continued (28 August). The trick to searching: queries are documents. Search evaluation: precision, recall, precision-recall curves; error rates. Classification: nearest neighbors and prototypes; classifier evaluation by mis-classification rate and by confusion matrices. Inverse document frequency weighting. Visualizing high-dimensional data by multi-dimensional scaling. Miscellaneous topics: stemming, incorporating user feedback.

    Homework 1, due 4 September: assignment, R, data; SOLUTIONS

  4. Page Rank (31 August). Links as pre-existing feedback. How to exploit link information? The random walk on the graph; using the ergodic theorem. Eigenvector formulation of page-rank. Combining page-rank with textual features. Other applications. Further reading on information retrieval.
  5. Image Search, Abstraction and Invariance (2 September). Similarity search for images. Back to representation design. The advantages of abstraction: simplification, recycling. The bag-of-colors representation. Examples. Invariants. Searching for images by searching text. An example in practice. Slides for this lecture.
  6. Information Theory I (4 September). Good features help us guess what we can't represent. Good features discriminate between different values of unobserved variables. Quantifying uncertainty with entropy. Quantifying reduction in uncertainty/ discrimination with mutual information. Ranking features based on mutual information. Examples, with code, of informative words for the Times. Code.
    Supplementary reading: David P. Feldman, Brief Tutorial on Information Theory, chapter 1

    Homework 2, due 11 September: assignment; SOLUTIONS TEXT; SOLUTIONS R

  7. Information Theory II (9 September). Dealing with multiple features. Joint entropy, the chain rule for entropy. Information in multiple features. Conditional information, chain rule for information, conditional independence. Interactions, positive and negative, and redundancy. Greedy feature selection with low redundancy. Example, with code, of selecting words for the Times. Sufficient statistics and the information bottleneck. Code.
    Supplementary reading; Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", arxiv:cs.AI/0308002
  8. Categorization; Clustering I (11 September). Dividing the world up into categories. Classification: known categories with labeled examples. Taxonomy of learning problems (supervised, unsupervised, semi-supervised, feedback, ...). Clustering: discovering unknown categories from unlabeled data. Benefits of clustering, with an digression on where official classes come from. Basic criterion for good clusters: lots of information about features from little information about cluster. Practical considerations: compactness, separation, parsimony, balance. Doubts about parsimony and balance. The k-means clustering algorithm, or unlabeled prototype classification: analysis, geometry, search. Appendix: geometric aspects of the prototype and nearest-neighbor method.

    Homework 3, due 18 September: assignment; SOLUTIONS

  9. Clustering II (14 September). Distances between partitions; variation-of-information distance. Hierarchical clustering by agglomeration and its varieties. Picking the number of clusters by merging costs. Performance of different clustering methods on various doodles. Why we would like to pick the number of clusters by predictive performance, and why it is hard to do at this stage. Reifying clusters.
  10. Transformations: Rescaling and Low-Dimensional Summaries (16 September). Improving on our original features. Re-scaling, standardization, taking logs, etc., of individual features. Forcing things to be Gaussian considered harmful. Low-dimensional summaries by combining features. Exploiting geometry to eliminate redundancy. Projections on to linear subspaces. Searching for structure-preserving projections.
  11. Principal Components I (18 September). Principal components are the directions of maximum variance. Derivation of principal components as the best approximation to the data in a linear subspace. Equivalence to variance maximization. Avoiding explicit optimization by finding eigenvalues and eigenvectors of the covariance matrix. Example of principal components with cars; how to tell a sports car from a minivan. The standard recipe for doing PCA. Cautions in interpreting PCA. Data-set used in the notes.

    Homework 4, due 25 September: assignment; SOLUTIONS

  12. Principal Components II (21 September). PCA + information retrieval = latent semantic indexing; why LSI is a Good Idea. PCA and multidimensional scaling.
  13. Factor Analysis (23 and 25 September). From PCA to factor analysis by adding noise. Roots of factor analysis in causal discovery: Spearman's general factor model and the tetrad equations. Problems with estimating factor models: number of equations does not equal number of unknowns.
  14. Factor Analysis continued (25 September). Solution 1, "principal factors", a.k.a. estimation through heroic feats of linear algebra. Solution 2, maximum likelihood, a.k.a. estimation through imposing distributional assumptions. The rotation problem: the factor model is unidentifiable; the number of factors may be meaningful, but the individual factors are not.
  15. The Truth about PCA and Factor Analysis (28 September) PCA is data reduction without any probabilistic assumptions about where the data came from. Picking number of components. Faking predictions from PCA. Factor analysis makes stronger, probabilistic assumptions, and delivers stronger, predictive conclusions --- which could be wrong. Using probabilistic assumptions and/or predictions to pick how many factors. Factor analysis as a first, toy instances of a graphical causal model. The rotation problem once more with feeling. Factor models and mixture models. Factor models and Thomson's sampling model: an outstanding fit to a model with a few factors is actually evidence of a huge number of badly measured latent variables. Final advice: it all depends, but if you can only do one, try PCA. R code for the Thomson sampling model.
  16. Nonlinear Dimensionality Reduction I: Locally Linear Embedding (5 October). Failure of PCA and all other linear methods for nonlinear structures in data; spirals, for example. Approximate success of linear methods on small parts of nonlinear structures. Manifolds: smoothly curved surfaces embedded in higher-dimensional Euclidean spaces. Every manifold looks like a linear subspace on a sufficiently small scale, so we should be able to patch together many small local linear approximations into a global manifold. Local linear embedding: approximate each vector in the data as a weighted linear combination of its k nearest neighbors, then find the low-dimensional vectors best reconstructed by these weights. Solving the optimization problems by linear algebra. Coding up LLE. A spiral rainbow. R.
  17. Nonlinear Dimensionality Reduction II: Diffusion Maps (9 October). Making a graph from the data; random walks on this graph. The diffusion operator, a.k.a. Laplacian. How the Laplacian encodes the shape of the data. Eigenvectors of the Laplacian as coordinates. Connection to page-rank. Advantages when data are not actually on a manifold. Example.

    Pre-midterm review (12 October): highlights of the course to date; no handout.
    MIDTERM (14 October): exam, solutions

    Homework 5, due 23 October: assignment; solutions

  18. Regression I: Basics. Guessing a real-valued random variable; why expectation values are mean-square optimal point forecasts. The regression function; why its estimation must involve assumptions beyond the data. The bias-variance decomposition and the bias-variance trade-off. First example of improving prediction by introducing variance. Ordinary least squares linear regression as smoothing. Other linear smoothers: k-nearest-neighbors and kernel regression. How much should we smooth? R, data for running example
  19. Regression II: The Truth About Linear Regression (21 October). Linear regression is optimal linear (mean-square) prediction; we do this because we hope a linear approximation will work well enough over a small range. What linear regression does: decorrelate the input features, then correlate them separately with the response and add up. The extreme weakness of the probabilistic assumptions needed for this to make sense. Difficulties of linear regression; collinearity, errors in variables, shifting distributions of inputs, omitted variables. The usual extra probabilistic assumptions and their implications. Why you should always looking at residuals. Why you generally shouldn't use regression for causal inference. How to torment angels. Likelihood-ratio tests for restrictions of nice models.
  20. Regression III: Extending Linear Regression (23 October). Weighted least squares. Heteroskedasticity: variance is not the same everywhere. Going to consult the oracle. Weighted least squares as a solution to heteroskedasticity. Nonparametric estimation of the variance function. Local polynomial regression: local constants (= kernel regression), local linear regression, higher-order local polynomials. Lowess = locally-linear smoothing for scatter plots. The oracles fall silent.

    Homework 6, due Friday, 30 October: assignment, data set

  21. Evaluating Predictive Models (26 and 28 October). In-sample, out-of-sample and generalization loss or error; risk as expected loss on new data. Under-fitting, over-fitting, and examples with polynomials. Methods of model selection and controlling over-fitting: empirical risk minimization, penalization, constraints/sieves, formal learning theory, cross-validation. Limits of generalization. R for creating figures.

Corrupting the Young; Enigmas of Chance

Posted by crshalizi at October 28, 2009 00:25 | permanent link

October 13, 2009

The Professions Considered as Pitchers of Icy Refreshing Lemonade

Attention conservation notice: Idle economic musings of a non-economist. Sparked by recent developments, but if you're interested in that you'd be better off elsewhere.

The usual libertarian story about professional licensing requirements — e.g., requiring someone who wants to practice medicine to go to medical school and pass exams, on pain of fines or jail — is that these are simply professionals conspiring in restraint of trade. Licensing simply erects a barrier to entry into the market for medical services, restricting supply and driving up price. Eliminate it, they say, and supply will expand and prices fall.

This presumes, however, that the demand for unlicensed professionals will be equal to the demand for licensed ones. It seems to me very easy to tell a "market for lemons" story here: someone in the market for professional services generally knows very little about how skilled various potential providers actually are. The sellers, however, generally know a lot about their own skill level, or at least more than the potential clients do. (There are no doubt exceptions, such as sincere quacks and the Dunning-Kreuger effect, but I don't think matters for the story.) This is the classic asymmetric information problem from Akerlof, with the usual result: the skilled providers demand more, but the clients have no way of telling them from the unskilled ones, so the only equilibrium is for only unskilled providers to be on the market and for trade to be depressed, or indeed absent. By putting a floor on the incompetence of professionals, licensing requirements stop the unraveling of the market and increase demand. They get us out of the market for lemons.

This occurred to me the other day, but it's obvious enough that I'm sure someone wrote it up long ago; where? (And did I read it and forget about it?)

(After-notes: 1. Of course, having told the story I have no idea if it's true of actual markets for professional services; learning that would require rather delicate empirical investigations. Checking the restraint-of-trade fable from Milton Friedman would, naturally, require those same investigations. 2. This doesn't rationalize why professions should be so largely self-governing, nor does it rule out the idea that some licensing requirements are counter-productive barriers to entry. 3. Replacing professional certification with some sort of market-based entity telling consumers about the quality of professional service-sellers won't work, for all the usual reasons that competitive markets are incapable of adequately providing information — to say nothing of the difficulty of telling whether the raters know what they're talking about. 4. Universities are accredited because students and parents would otherwise be in a market for lemons. Universities themselves, however, can tell how skilled those selling academic services are — or at least they're supposed to have that ability. 5. I should re-read Phil Agre on the professionalization of everything and see if it holds up.)

The Dismal Science

Posted by crshalizi at October 13, 2009 22:26 | permanent link

October 09, 2009

Twilight of the Market Gods

My review of Justin Fox's Myth of the Rational Market in American Scientist is out. (Shorter me: read the book.) Sometime soon I'll put up a version with links, which alas don't work in print.

Manual trackback: 3 Quarks Daily

The Dismal Science

Posted by crshalizi at October 09, 2009 00:54 | permanent link

October 08, 2009

Wit and Wisdom of Pittsburgh Bar Patrons (Part 1)

"They [= the Steelers] are like this utterly adorable, totally hot girl next door, who you suddenly realize is everything you've ever wanted in a football team — I mean, girlfriend."

Heard About Pittsburgh PA

Posted by crshalizi at October 08, 2009 23:00 | permanent link

"Completely Random Measures for Bayesian Nonparametrics" (This Year at the DeGroot Lecture)

Attention conservation notice: Only of interest if you (1) care about specifying probability distributions on infinite-dimensional spaces for use in nonparametric Bayesian inference, and (2) are in Pittsburgh.

The CMU statistics department sponsors an annual distinguished lecture series in memory of our sainted founder, Morris H. DeGroot. This year, the lecturer is Michael Jordan. (I realize that's a common name; I mean the one my peers and I wanted to be when we grew up.)

"Completely Random Measures for Bayesian Nonparametrics"
Abstract: Bayesian nonparametric modeling and inference are based on using general stochastic processes as prior distributions. Despite the great generality of this definition, the great majority of the work in Bayesian nonparametrics is based on only two stochastic processes: the Gaussian process and the Dirichlet process. Motivated by the needs of applications, I present a broader approach to Bayesian nonparametrics in which priors are obtained from a class of stochastic processes known as "completely random measures" (Kingman, 1967). In particular I will present models based on the beta process and the Bernoulli process, and will discuss an application of these models to the analysis of motion capture data in computational vision.
(Joint work with Emily Fox, Erik Sudderth and Romain Thibaux.)
Time and place: 4:15 pm on Friday, 16 October 2009, in the Giant Eagle Auditorium in Baker Hall (room A51)

Update: I counted over 210 people in the audience.

Enigmas of Chance

Posted by crshalizi at October 08, 2009 15:02 | permanent link

"High Dimensional Nonlinear Learning using Local Coordinate Coding" (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about statistical learning in high-dimensional spaces and (2) are in Pittsburgh.

Since manifold learning has been on my mind this week, owing to trying to teach it in data-mining, I am extra pleased by the scheduling of this talk:

"High Dimensional Nonlinear Learning using Local Coordinate Coding"
Prof. Tong Zhang, Rutgers University
Abstract: We present a new method for learning nonlinear functions in high dimension using semisupervised learning. Our method includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point on a high dimensional manifold can be locally approximated by a linear combination of its nearby anchor points, with the linear weights offering its local-coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difficult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. The empirical success of our approach has been demonstrated in a recent pascal image classification competition, where the top performance was achieved by an NEC system using this idea.
(Joint work with Kai Yu at NEC Lab America.)
Time and place: 4 pm on Monday, 12 October 2009, in Doherty Hall 310

As always, the seminar is free and open to the public.

Enigmas of Chance

Posted by crshalizi at October 08, 2009 15:01 | permanent link

October 05, 2009

In re John Holland

Having vowed two weeks ago to post something positive at least once a week, I missed last week, with the excuse of being back in Ann Arbor for the celebration of John Holland's 80th birthday at the Center for the Study of Complex Systems. There was no time to post, or even to see everyone I wanted to, but I did actually start writing something about Holland's scientific work, only to realize yesterday I was merely engaged in self-plagiarism, from this, this and this, and probably other things I'd written too, because reading Holland has quite profoundly shaped my thinking. So I'll just point you to the back-catalogue, as it were, and get back to revising a paper I'd never have written if I hadn't read Adaptation in Natural and Artificial Systems.

(So long as I'm talking about the workshop, and without any slight to the other presentations, the neatest work was that by Stephanie Forrest et al. on using genetic programming to evolve bug fixes.)

Complexity; Minds, Brains, and Neurons; Enigmas of Chance

Posted by crshalizi at October 05, 2009 14:30 | permanent link

October 02, 2009

"Analyzing Networks and Learning with Graphs"

See you in Whistler?

Analyzing Networks and Learning with Graphs


a workshop in conjunction with

23nd Annual Conference on Neural Information Processing Systems (NIPS 2009)


December 11 or 12, 2009 (exact date TBD) Whistler, BC, Canada

Deadline for Submissions: Friday, October 30, 2009
Notification of Decision: Monday, November 9, 2009

Overview:

Recent research in machine learning and statistics has seen the proliferation of computational methods for analyzing networks and learning with graphs. These methods support progress in many application areas, including the social sciences, biology, medicine, neuroscience, physics, finance, and economics.

The primary goal of the workshop is to actively promote a concerted effort to address statistical, methodological and computational issues that arise when modeling and analyzing large collection of data that are largely represented as static and/or dynamic graphs. To this end, we aim at bringing together researchers from applied disciplines such as sociology, economics, medicine and biology, together with researchers from more theoretical disciplines such as mathematics and physics, within our community of statisticians and computer scientists. Different communities use diverse ideas and mathematical tools; our goal is to to foster cross-disciplinary collaborations and intellectual exchange.

Presentations will include novel graph models, the application of established models to new domains, theoretical and computational issues, limitations of current graph methods and directions for future research.

Online Submissions

We welcome the following types of papers:
  1. Research papers that introduce new models or apply established models to novel domains,
  2. Research papers that explore theoretical and computational issues, or
  3. Position papers that discuss shortcomings and desiderata of current approaches, or propose new directions for future research.
All submissions will be peer-reviewed; exceptional work will be considered for oral presentation. We encourage authors to emphasize the role of learning and its relevance to the application domains at hand. In addition, we hope to identify current successes in the area, and will therefore consider papers that apply previously proposed models to novel domains and data sets.

Submissions should be 4-to-8 pages long, and adhere to NIPS format. Please email your submissions to: nipsgraphs2009 [at] gmail [dot] com

Workshop Format

This is a one-day workshop. The program will feature invited talks, poster sessions, poster spotlights, and a panel discussion. All submissions will be peer-reviewed; exceptional work will be considered for oral presentation. More details about the program will be announced soon.

Organizers

Networks; Enigmas of Chance; Incestuous Amplification

Posted by crshalizi at October 02, 2009 10:24 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems