Data Mining
07 Jan 2008 21:35
I've taught a course on this, so I ought to be able to describe it, oughtn't I? Data mining, more stuffily "knowledge discovery in databases", is the art of finding and extracting useful patterns in very large collections of data. It's not quite the same as machine learning, because, while it certainly uses ML techniques, the aim is to directly guide action (praxis!), rather than to develop a technology and theory of induction. In some ways, in fact, it's closer to what statistics calls "exploratory data analysis", though with certain advantages and limitations that come from having really big data to explore.
- Recommended:
- Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16 (2001): 199--231 [very much including the discussion by others and the reply by Breiman]
- David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I assigned; also a book I learned a lot from]
- Bernard E. Harcourt, Against Prediction: Profiling, Policing, and Punishing in an Actuarial Age [Blurb. Precis as a 43 pp. PDF working paper]
- Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
- Kling, Scherson and Allen, "Parallel Computing and Information Capitalism," in Metropolis and Rota (eds.), A New Era in Computation (1992) [A batch of UC Irvine comp. sci. professors who write like sociologists. " `Information capitalism' refers to forms of organization in which data-intensive techniques and computerization are key strategic resources for corporate production."]
- Erik Larson, The Naked Consumer
- Sholom M. Weiss and Nitin Indrukyha, Predictive Data Mining: A Practical Guide [Pedestrian, but it is practical, and adapted to the meanest, i.e. the managerial, understanding]
- To read:
- Ian Ayres, Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart [Despite the painful title, Ayres has done a lot of interesting work on social statistics]
- David L. Banks and Yasmin H. Said, "Data Mining in Electronic Commerece", math.ST/0609204 = Statistical Science 21 (2006): 234--246
- Burnham, Rise of the Computer State
- Pavel Dmitriev and Carl Lagoze, "Mining Generalized Graph Patterns based on User Examples", cs.DS/0609153
- Usama Fayyad, Geroges G. Grinstein and Andreas Wierse (eds.), Information Visualization in Data Mining and Knowledge Discovery
- Ronen Feldman and James Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data [Blurb
- Hillol Kargupta and Philip Chan (eds.), Advances in Distributed and Parallel Knolwedge Discovery [Blurb]
- Hillol Kargupta, Anupam Joshi, Krishnamoorthy Sivakumar and Yelena Yesha, Data Mining: Next Generating Challenges and Future Directions [Blurb]
- Nicholas M. Kiefer and C. Erik Larson, "Specification and Informational Issues in Credit Scoring" [SSRN]
- Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data Blurb]
- Colleen McCue, Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis
- Michalski, Kubat, Bratko and Bratko (eds.), Machine Learning and Data Mining: Methods and Applications
- Naren Ramakrishnan and Chris Bailey-Kellogg, "Sampling Strategies for Mining in Data-Scarce Domains," cs.CE/0204047
- Juan J. Samper, Pedro A. Castillo, Lourdes Araujo, and J. J. Merelo, "NectaRSS, an RSS feed ranking system that implicitly learns user preferences", cs.IR/0610019
- Daniel J. Solove, "Data Mining and the Security-Liberty Debate" [SSRN/990030]
- Andreas L. Symeonidis and Pericles A. Mitkas, Agent Intelligence through Data Mining [Blurb]
- Joseph Turow, Niche Envy: Marketing Discrimination in the Digital Age [Blurb]
- Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
- Johannes Wollbold, "Attribute Exploration of Discrete Temporal Transitions", q-bio/0701009
- Mohammed Javeed Zaki
- Scalable Data Mining for Rules [Ph.D. thesis, U. of Rochester, 1998; on-line through NCSTRL]
- "SPADE: An Efficient Algorithm for Mining Frequent Sequences," Machine Learning 42 (2001): 31--60
