Clustering
15 Apr 2012 17:31
A topic in data mining and statistics: given a big bunch of data points, assign them to a discrete set of groups in a way which somehow reflects the natural divisions among them, without knowing in advance what the groups are. This is the unsupervised counterpart to classification. (You see where the connection with induction comes in.) This is an important subject, but one of the topics I most dislike teaching in data mining, because the students' natural question is always "how do I know when my clustering algorithm is giving me a good solution?", and it's very hard to give them a reasonable answer. I think this is because most other data-mining problems are basically predictive, and so one can ask how good the prediction is; what's the best way to turn clustering into a prediction problem? (Probabilistic mixture models suggest themselves, of course.)
- Recommended, big picture:
- David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I teach from; also a book I learned a lot from. Comments]
- Isabelle Guyon, Ulrike von Luxburg, and Robert C. Williamson, "Clustering: Science or Art?" [Online PDF]
- Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data [Comments]
- Recommended, close-ups:
- Margaret Ackerman and Shai Ben-David, "Measures of Clustering Quality: A Working Set of Axioms for Clustering", NIPS 2008 [PDF. A rebuttal to Kleinberg's paper on clustering. Thanks to "arthegall" and Ed Vielmetti for pointers.]
- David Arthur and Sergei Vassilvitskii, "k-means++: The Advantages of Careful Seeding" [PDF preprint via Dr. Arthur]
- Sébastien Bubeck, Ulrike von Luxburg, "Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions", Journal of Machine Learning Research 10 (2009): 657--698
- Jon Kleinberg, "An Impossibility Theorem for Clustering", Advances in Neural Information Processing Systems 15 (NIPS 2002) [Thanks to Aaron Clauset for pointing me to this. PDF from Kleinberg and from NIPS. But see Ackerman and Ben-David, above.]
- Modesty forbids me to recommend:
- My lecture notes for my data mining class [However, many of them are based on lecture notes originally written by Tom Minka, and modesty does not forbid me from recommending his work.]
- To read:
- L. Angelini, L. Nitti, M. Pellicoro and Sebastiano Stramaglia, "Cost functions for pairwise data clustering," cond-mat/0103414
- Hendrik Blockeel, Luc De Raedt and Jan Ramon, "Top-down induction of clustering trees," cs.LG/0011032
- Joachim M. Buhmann, "Information theoretic model validation for clustering", arxiv:1006.0375
- Gunnar Carlsson, Facundo Mémoli, "Characterization, Stability and Convergence of Hierarchical Clustering Methods", Journal of Machine Learning Research 11 (2010): 1425--1470
- Ricardo Fraiman, Badih Ghattas, Marcela Svarc, "Clustering using Unsupervised Binary Trees: CUBT", arxiv:1011.2624
- Clara Granell, Sergio Gomez, Alex Arenas, "Mesoscopic analysis of networks: applications to exploratory analysis and data clustering", arxiv:1101.1811
- Madjid Khalilian, Norwati Mustapha, "Data Stream Clustering: Challenges and Issues", arxiv:1006.5261
- Daniel Korenblum and David Shalloway, "Macrostate Data Clustering", Physical Review E 67 (2003): 056704 [This sounds a lot like spectral clustering and diffusion maps]
- Marta Luksza, Michael Lassig and Johannes Berg, "Significance Analysis and Statistical Mechanics: An Application to Clustering", Physical Review Letters 105 (2010): 220601
- Marina Meila, Machine Learning 86 (2012): 369--389
- Vladimir Nikulin and Geoffrey J. McLachlan, "Strong Consistency of Prototype Based Clustering in Probabilistic Space", arxiv:1004.3101 [The regularization they're using isn't at all obvious on first scan of the paper]
- Christoph Pamminger and Sylvia Fruühwirth-Schnatter, "Model-based Clustering of Categorical Time Series", Bayesian Analysis 5 (2010): 345--368
- Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman, "Stability of Density-Based Clustering", arxiv:1011.2771
- Alessandro Rinaldo and Larry Wasserman, "Generalized Density Clustering", Annals of Statistics 38 (2010): 2678--2722, arxiv:0907.3454
- Oha Shamir and Naftali Tishby, "Stability and model selection in k-means clustering", Machine Learning 80 (2010): 213--243
- Qing Song, "A Robust Information Clustering Algorithm", Neural Computation 17 (2005): 2672--2698 ["We focus on the scenario of robust information clustering (RIC) based on the minimax optimization of mutual information (MI). The minimization of MI leads to the standard mass-constrained deterministic annealing clustering, which is an empirical risk-minimization algorithm. The maximization of MI works out an upper bound of the empirical risk via the identification of outliers (noisy data points). Furthermore, we estimate the real risk VC-bound and determine an optimal cluster number of the RIC based on the structural risk-minimization principle. One of the main advantages of the minimax optimization of MI is that it is a nonparametric approach, which identifies the outliers through the robust density estimate and forms a simple data clustering algorithm based on the square error of the Euclidean distance."]
- Susanne Still and William Bialek, "How many clusters? An information theoretic perspective," physics/0303011 = Neural Computation 16 (2004): 2483--2506
- Wei Sun, Junhui Wang, and Yixin Fang, "Regularized k-means clustering of high-dimensional data and its asymptotic consistency", Electronic Journal of Statistics 6 (2012): 148--167
- Nguyen Xuan Vinh, Julien Epps, James Bailey, "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance", Journal of Machine Learning Research 11 (2010): 2837--2854
- Ulrike von Luxburg, "A Tutorial on Spectral Clustering", arxiv::0711.0189
- Junhui Wang, "Consistent selection of the number of clusters via crossvalidation", Biometrika 97 (2010): 893--904
