Machine Learning, Statistical Inference and Induction
24 Apr 2008 08:19
There's a place where AI, statistics and epistemology-methodology converge, or want to anyhow. "Machine learning" is the AI label: how do we make a machine that can find and learn the regularities in a data set? (If the data set is really, really big, and we care mostly about making practically valuable predictions, this becomes data mining, or "knowledge discovery in databases," KDD.) The statisticians ask very similar questions about model-fitting and hypothesis-testing. The epistemologists are mired in the problem of induction, and "inference to the best explanation". (Who coined that last phrase?) The fields over-lap in the most crazy-quilt and arbitrary way: I've heard university librarians arguing over whether specific books should go to the engineering or the philosophy library, for instance.
The connection to neuroscience and cognitive science is plain: how on Earth do human beings, and other critters, actually learn? Given that there are many different strategies, which ones do organisms use, and why, and are they good ones? (It's entirely possible that we've gotten locked in to inefficient learning strategies; then the question becomes whether or not they can be improved.) Studying learning by organisms lets us test theories of learning-in-the-abstract, and vice versa: if we had, say, a good proof that a certain learning scheme simply would not work, we'd know that animals don't use it.
One fairly strong result seems to be that tabulae rasae don't work: you've got to give the machine/baby/scientist some hints, or restrict the field of possible hypotheses initially, or you'll never get anywhere. This was at least implicit in Hume, and I believe the other classical empiricists as well, but they don't seem to have been restrictive enough to account for the way we actually do learn. Natural selection is the obvious candidate for having restricted our hypothesis-set, and for having designed our learning mechanisms.
My positivist temperament can hardly help being pleased by this "attempt to introduce the experimental method of reasoning into moral subjects," which, as data mining, has massive industrial applications. My real interest in this isn't, for once, philosophical. Instead, I want to be able to quantify, or at the very least characterize, self-organization, which means I need a good way of automatically finding patterns or regularities in data-sets. For someone who's got the computational mechanics gospel, this means "inferring statistical complexity," and that means the automated construction of abstract-machine or formal-language models of data-sets. (Alternately: Figuring out how natural things compute.) And doing that well means addressing all the issues people in these areas address, so I figure I ought to just steal from them.
Causality; collective cognition; ensemble methods; grammatical inference; graphical models; learning in games; learning theory; the minimum description length principle; model selection; neural nets; sequential decision-making; time series; and universal prediction algorithms now get their own notebooks; other topics also need to be spun off from this one.
- Recommended, big picture:
- Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16 (2001): 199--231 [Very much including the discussion by others and the reply by Breiman. Thanks to Chris Wiggins for alerting me to this.]
- Ulf Grenander, Elements of Pattern Theory
- David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining
- Trever Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction
- John H. Holland, Keith J. Holyoak, Richard E. Nisbett, and Paul R. Thagard, Induction: Process of Inference, Learning and Discovery [Review: The Best-Laid Schemes o' Mice an' Men]
- Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory [Review: How to Build a Better Guesser]
- Deborah G. Mayo, Error and the Growth of Experimental Knowledge [How to use standard statistical tests to learn from experiment, without Bayesian priors or other a priori folderol. Review: We Have Ways of Making You Talk, or, Long Live Peircism-Popperism-Neyman-Pearson Thought!]
- Deborah G. Mayo and D. R. Cox, "Frequentist statistics as a theory of inductive inference", math.ST/0610846
- Jorma Rissanen, Stochastic Complexity in Statistical Inquiry [Review: Less Is More, or, Ecce data!]
- Sara J. Shettleworth, Cognition, Evolution and Behavior
- Peter Spirtes, Clark Glymour and Richard Scheines, Causation, Prediction, and Search
- Chris Thornton, Truth from Trash: How Learning Makes Sense [Well, half a recommendation. Review: Two Cheers for Trash]
- V. N. (=Vladimir Naumovich) Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
- H. Peyton Young, Individual Strategy and Social Structure [Pretty dumb agents nonetheless able to learn in a basic sense, and what they can accomplish in the way of societies. Review: A Myopic (and Sometimes Blind) Eye on the Main Chance, or, the Origins of Custom]
- Recommended, close-ups:
- Shun-ichi Amari, "Information Geometry on Hierarchical Decomposition of Stochastic Interactions," IEEE Transactions on Information Theory 47 (2001): 1701-11 [A way of finding "parts" in complex distributions; uses many differential geometry tricks to do statistics. PDF reprint]
- Massimiliano Badino, "An Application of Information Theory to the Problem of the Scientific Experiment", Synthese 140 (2004): 355--389 [MS Word preprint. See comments under Information Theory.]
- Jonathan Baxter, "A Model of Inductive Bias Learning," Journal of Artificial Intelligence Research 12 (2000): 149--198 [How to learn what class of hypotheses you should be trying to use, i.e., your inductive bias. Assumes independence, again.]
- William Bialek, Ilya Nemenman, and Naftali Tishby, "Predictability, Complexity and Learning," physics/0007070
- Ken Binmore, Making Decisions in Large Worlds ["This paper argues that we need to look beyond Bayesian decision theory for an answer to the general problem of making rational decisions under uncertainty." PDF manuscript; thanks to Nicolas Della Penna for the pointer]
- Margaret Boden, The Creative Mind: Myths and Mechanisms [How and when to change the kind of representation you're using, a topic shamefully neglected in the literature. Precis]
- Josh Bongard and Hod Lipson, "Automated reverse engineering of nonlinear dynamical systems", Proceedings of the National Academy of Sciences (USA) 104 (2007): 9943--9948 [Thanks to Chris Weed for pointing me to this. Interesting, but basically unaware of the literature on state-space reconstruction in nonlinear dynamics.]
- R. B. Braithwaite, Scientific Explanation
- Pedro Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery, 3 (1999) [Online]
- Marco Dorigo and Marco Colombetti, Robot Shaping: An Experiment in Behavior Engineering [Review: Crawling Towards the Light]
- John W. Fisher III, Alexander T. Ihler and Paula A. Viola, "Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in NIPS 12 (1999) [PDF reprint. I'd call this more of a semi-parametric approach than a fully non-parametric one; they assume a parametric form for the dependence structure, but are agnostic about the distributions of innovations, and so try to maximize non-parametrically estimated mutual informations.]
- David J. Hand, "Classifier Technology and the Illusion of Progress", Statistical Science 21 (2006): 1--15 = math.ST/0606441 [Or: don't believe everything you read in ICML! With commentary, available from the arxiv.org link]
- Hinton and Sejnowski (eds.), Unsupervised Learning [A sort of "Neural Computation's Greatest Hits" compilation]
- Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
- Kevin T. Kelly, "A New Solution to the Puzzle of Simplicity", phil-sci/2984 [This is a summary of Kelly's recent work on Occam's Razor, which is, so far as I know, the only justification for it which doesn't either massively beg the question, or make massive assumptions about the nature of the world, Divine Providence, etc.]
- Shane Legg, "Is There an Elegant Universal Theory of Prediction?", cs.AI/0606070 [A nice set of diagonalization arguments against the hope of a universal prediction scheme which has the nice features of Solomonoff-style induction, but is actually computable.]
- Jerzy Neyman, First Course in Probability and Statistics [Fine explanation of his ideas about "rules of inductive behavior" --- which probably isn't very good methodology, but has the makings of excellent robotics]
- Leonid Peshkin, "Structure induction by lossless graph compression", cs.DS/0703132 [Adapting data-compression ideas to discover hierarchical structures in graphs, e.g., the 4 bases from a tinker-toy model of DNA.]
- Gerhard Schurz, "Universal vs. Local Prediction Strategies: A Game-Theoretical Approach to the Problem of Induction", phil-sci/3720 [Slides only?!?]
- Spyros Skouras, "Decisionmetrics: Towards a Decision-Based Approach to Econometrics" [Suppose what you really want to do with your model is to make decisions, e.g., to buy and sell and make money doing so. Then fitting the model to minimize a standard error measure, e.g., mean square error, often gives worse performance than fitting the model to minimize expected losses. This applies much more broadly than Spyros's financial examples may suggest.]
- Aris Spanos, "The Curve-Fitting Problem, Akaike-type Model Selection, and the Error Statistical Approach" [PDF preprint]
- Sara van de Geer, Applications of Empirical Process Theory [A.k.a. Empirical Process Theory in M-Estimation]
- Blaz Zupan, Marko Bohanec, Janez Demsar and Ivan Bratko, "Learning by discovering concept hierarchies", Artificial Intelligence 109 (1999): 211--242 [Thanks to Aleks Jakulin for letting me know about this. PDF preprint]
- Not exactly recommended:
- Dana Ballard, An Introduction to Natural Computation [Review: Not Natural Enough]
- Gilbert Harman and Sanjeev Kulkarni, Reliable Reasoning: Induction and Statistical Learning Theory [Published by MIT Press; 2006 draft free online via Prof. Kulkarni (about 100 pages). The technical material on learning theory is alright, so far as it goes, but the philosophy is irritatingly lack-luster. Definitely not worth paying what the publisher charges for it.]
- Modesty forbids me to recommend:
- CRS, Causal Architecture, Complexity and Self-Organization in Time Series and Cellular Automata [Ph.D. thesis, UW-Madison, 2001]
- CRS and Kristina Lisa Klinkner, "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences", pp. 504--511 in UAI 2004, cs.LG/0406011
- To read:
- Tatsuya Akutsu, Satoru Miyanoa and Satoru Kuhar, "A simple greedy algorithm for finding functional relations: efficient implementation and average case analysis," Theoretical Computer Science 292 (2002): 481--495
- Atocha Aliseda, Abductive Reasoning: Logical Investigations into Discovery and Explanation [Blurb]
- Luis B. Almeida, "MISEP - Linear and Nonlinear ICA Based on Mutual Information," Journal of Machine Learning Research submitted [online]
- Ethem Alpaydin, Introduction to Machine Learning [Blurb, author's book-site]
- Andris Ambainis, "Probabilistic inductive inference: a survey", cs.LG/9902026 [Taking "inductive inference" exclusively in the sense of learning recursrive functions]
- L. Angelini, L. Nitti, M. Pellicoro and Sebastiano Stramaglia, "Cost functions for pairwise data clustering," cond-mat/0103414
- Nihat Ay
- "Locality of global stochastic interaction in directed acyclic networks," preprint, MPI-MIS 54/2001
- "An information geometric approach to a theory of pragmatic structuring," MPI-MIS 52/2000
- Olivier Aycard, Jean-Francois Mari and Richard Washington, "Learning to automatically detect features for mobile robots using second-order Hidden Markov Models", cs.AI/0501068
- Chris Bailey-Kellogg and Naren Ramakrishnan, "Qualitative Analysis of Correspondence for Experimental Algorithmics," cs.AI/0204053
- Vijay Balasubramanian, "Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions", Neural Computation 9 (1997): 349--368
- Pierre Baldi et al., Modeling the Internet and the Web: Probabilistic Methods and Algorithms
- Jayanta Basak, "Online Adaptive Decision Trees", Neural Computation 16 (2004): 1959--1981
- William Bechtel and Robert C. Richardson, Discovering Complexity: Decomposition and Localization as Strategies in Scientific Research [Blurb]
- Sergey V. Beiden, Marcus A. Maloof and Robert F. Wagner, "A General Model for Finite-Sample Effects in Training and Testing of Competing Classifiers", IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003): 1561--1569
- D. Paul Benjamin (ed.), Change of Representation and Inductive Bias
- Francesco Bergadano and Daniele Gunetti, Inductive Logic Programming: From Machine Learning to Software Engineering
- S. A. Billings and H.-L. Wei, "A New Class of Wavelet Networks for Nonlinear Systems Identificiation", IEEE Transactions on Neural Networks 16 (2005): 862--874
- James Blachowicz, Of Two Minds: The Nature of Inquiry [From the back cover: "The logic of correction developed here directly opposes the claim made by evolutionary epistemologists such as Popper and Campbell that there is no such thing as a 'logical method for having new ideas.' ... This comprehensive and revolutionary theory challenges traditional epistemology's conception of justification and provides substantial new interpretations of the nature of ampliative inference, representation and meaning, Platonic and Hegelian dialectic, Kantian analysis, the heuristic function of models and metaphors, and the role of inquiry in the constitution of human consciousness." All this in only four hundred pages! But the stuff on a logic of correction is very important --- if correct.]
- Gilles Blanchard and Donald Geman, "Hierarchical testing designs for pattern recognition", math.ST/0507421 = Annals of Statistics 33 (2005): 1155--1202
- Hendrik Blockeel, Luc De Raedt and Jan Ramon, "Top-down induction of clustering trees," cs.LG/0011032
- Hendrik Blockeel and Jan Struyf, "Efficient algorithms for decision tree cross-validation," cs.LG/0110036
- Abrim Blum, Adam Kalai and Hal Wasserman, "Noise-Tolerant Learning, the Parity Problem, and the Statistical Query Model," cs.LG/0010022
- Leo Breiman, "Prediction Games and Arcing Algorithms," Neural Computation 11 (1999): 1493--1517
- Robert Alan Brown, Machines that Learn: Based on the Principle of Empirical Control
- Meir Buzaglo, The Logic of Concept Expansion [blurb]
- Adam Cannon, J. Mark Ettinger, Don Hush, and Clint Scovel, "Machine Learning with Data Dependent Hypothesis Classes," JMLR 2 (2002): 335--358
- Philip Ellery Catton, "The Justification(s) of Induction(s)," online
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games
- Tommy W. S. Chow and D. Huang, "Estimating Optimal Feature Subsets Using Efficient Estimation of High-Dimensional Mutual Information", IEEE Transactions on Neural Networks 16 (2005): 213--224
- Andy Clark and Chris Thornton, "Trading Spaces: Computation, Representation and the Limits of Uninformed Learning," Behavioral and Brain Sciences (1997) 20:57--90 [Draft]
- Marco Cuturi and Kenji Fukumizu, "Multiresolution Kernels", cs.LG/0507033
- Peter Dayan, "Recurrent Sampling Models for the Helmholtz Machine," Neural Computation 11 (1999): 653--677
- Carlos R. de la Mora B., Carlos Gershenson and Angelica Garcia-Vega, "The role of behavior modifiers in representation development", cs.AI/0403006
- Luc Devroye et al., A Probabilistic Theory of Pattern Recognition
- Thomas G. Dietterich, "Machine Learning for Sequential Data" [PDF. Thanks to Gustavo Lacerda for a pointer.]
- Yannis Dimopoulos and Antonis Kakas, "Information Integration and Computational Logic," cs.AI/0106025
- Pedro Domingos [All from his web-site]
- A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
- Mining High-Speed Data Streams
- Mining Time-Changing Data Streams
- Dowe, Korb and Oliver (eds.), Information, Statistics and Induction in Science
- Daniel Egloff, "Monte Carlo Algorithms for Optimal Stopping and Statistical Learning", math.PR/0408276
- Yair Even-Zohar and Dan Roth, "A Sequential Model for Multi-Class Classification," cs.AI/0106044
- Deniz Erdogmus, Kenneth E. Hild, II, Yadunandana N. Rao and José C. Príncipe, "Minimax Mutual Information Approach for Independent Component Analysis", Neural Computation 16 (2004): 1235--1252
- Oleg V. Favorov and Dan Ryder, "SINBAD: A neocortical mechanism for discovering environmental variables and regularities hidden in sensory input", Biological Cybernetics 90 (2004): 191--202
- Aidan Feeney and Evan Heit (eds.), Inductive Reasoning: Cognitive, Mathematical, and Neuroscientific Approaches [blurb]
- Jacob Feldman, "How surprising is a simple pattern? Quantifying 'Eureka!'," Cognition 93(2004): 199--224 [Claims to (a) have a psychologically valid measure of subjective complexity, and (b) derive a null distribution for it!]
- David Finton, "When Do Differences Matter? On-Line Feature Extraction Through Cognitive Economy", cs.LG/0404032 = Cognitive Systems Research 6 (2005): 263--281
- Gary William Flake, "The Calculus of Jacobian Adaptation"
- Francois Fleuret and Eric Brunet, "DEA: An Architecture for Goal Planning and Classification," Neural Computation 12 (2000): 1987--2008
- Flocchini et al. (eds.), Structure, Information and Communication Complexity
- Malcolm R. Forster, "How do Simple Rules 'Fit to Reality' in a Complex World?", Minds and Machines 9 (1999): 543--564 [A take on the Gigerenzer et al. idea of fast and frugal heuristics, especially their ecological adaptation to the evnironment. "The main purpose of this article is to apply these ideas to learning rules --- methods for constructing, selecting or evaluating competing hypotheses in science, and to the methodology of machine learning... The bad news is that ecological validity is particularly difficult to implement and difficult to understand. The good news is that it builds an important bridge from normative psychology and machine learning to recent work in the philosophy of science, which considers predictive accuracy to be a primary goal of science."]
- Paul Franchesi, "A Solution to Goodman's Paradox," Dialogue 40 (2001) [online]
- Floris Geerts, Bart Goethals and Jan Van den Bussche, "A Tight Upper Bound on the Number of Candidate Patterns," cs.DB/0112007
- S. Gey and E. Nedelec, "Model Selection for CART Regression Trees", IEEE Transactions on Information Theory 51 (2005): 658--670
- Vinod Goel and Raymond J. Dolan, "Differential involvement of left prefrontal cortex in inductive and deductive reasoning", Cognition 93 (2004): B109--B121
- John C. Gower and Jörg Blasius, "Multivariate Prediction with
Nonlinear Principal Components Analysis"
- "Theory", Quality and Quantity 39 (2005): 359--372
- "Application", Quality and Quantity 39 (2005): 373--390
- Ulf Grenander, Abstract Inference
- Laszlo Gyorfi et al., A Distribution-Free Theory of Nonparametric Regression
- Stephen José Hanson et al., eds., Computational
Learning Theory and Natural Learning Systems
- I: Constraints and Prospects
- II: Interactions between Theory and Experiment
- Petr Hajek and Martin Holena, "Formal logics of discovery and hypothesis formation by machine," Theoretical Computer Science 292 (2002): 345-357
- Peter Hall and Qiwei Yao, "Approximating conditional distribution functions using dimension reduction", math.ST/0507432 = Annals of Statistics 33 (2005): 1404--1421
- Patrick Heas and Mihai Datcu, "Supervised learning on graphs of spatio-temporal similarity in satellite image sequences", 0709.3013
- Jaako Hintikka, Socratic Epistemology: Explorations of Knowledge-Seeking by Questioning [blurb]
- Tin Kam Ho, "A Numerical Example on the Principles of Stochastic Discrimination", cs.CV/0402021
- Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka and Hannu Toivonen, "TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies," The Computer Journal 42 (1999): 100--111
- Christian Igel and Marc Toussaint, "On Classes of Functions for which No Free Lunch Results Hold," cs.NE/0108011
- John R. Josephson and Susan G. Josephson (eds.), Abductive Inference: Computation, Philosophy, Technology [blurb]
- L. P. Kaelbling, Learning in Embedded Systems
- Yuri Kalnishkan, Vladimir Vovk and Michael V. Vyugin, "How many strings are easy to predict?", Information and Computation 201 (2005): 55--71 ["It is well known in the theory of Kolmogorov complexity that most strings cannot be compressed; more precisely, only exponentially few (O(2^n-m)) binary strings of length n can be compressed by m bits. This paper extends the 'incompressibility' property of Kolmogorov complexity to the 'unpredictability' property of predictive complexity. The 'unpredictability' property states that predictive complexity (defined as the loss suffered by a universal prediction algorithm working infinitely long) of most strings is close to a trivial upper bound (the loss suffered by a trivial minimax constant prediction strategy). We show that only exponentially few strings can be successfully predicted and find the base of the exponent."]
- Michael Kearns and Dana Ron, "Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation," Neural Computation 11 (1999): 1427--1453
- Kevin T. Kelly
- The Logic of Reliable Inquiry [Includes cartoons by the author]
- "How Simplicity Helps You Find the Truth without Pointing at It"
- "Simplicity, Truth, and the Unending Game of Science" [PDF preprint]
- David Klahr (with Kevin Dunbar, Anne L. Fay, David Penner and Christian D. Schunn), Exploring Science: The Cognition and Development of Discovery Processes
- Eric D. Kolaczyk and Robert D. Nowak, "Multiscale likelihood analysis and complexity penalized estimation", math.ST/0406424 = Annals of Statistics 32 (2004): 500--527
- Daniel Korenblum and David Shalloway, "Macrostate data clustering", Physical Review E 67 (2003): 056704
- Barbara Koslowski, Theory and Evidence: The Development of Scientific Reasoning
- Ingo Kreuz and Dieter Roller, "Relevant Knowledge First: Reinforcement Learning and Forgetting in Knowledge Based Configuration," cs.AI/0109034
- Yoshimitsu Kudoh, Makoto Haraguchi and Yoshiaki Okubo, "Data abstractions for decision tree induction," Theoretical Computer Science 292 (2002): 387-416
- Henry E. Kyburg Jr. and Choh Man Teng, "Evaluating Defaults," cs.AI/0207083
- Steffen Lange and Gunter Grieser, "Variants of iterative learning," Theoretical Computer Science 292 (2002): 359--376
- Ming Li, John Tromp and Paul Vitanyi, "Sharpening Occam's Razor," cs/0201005
- F. Liang and A. Barron, "Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection", IEEE Transactions on Information Theory 50 (2004): 2708--2726
- Stephen Luttrell, "Using Self-Organising Mappings to Learn the Structure of Data Manifolds", cs.NE/0406017
- David J. C. MacKay, Information Theory, Inference and Learning Algorithms [Online version]
- Heikki Mannila and Kari-Jouko Räihä, "On the complexity of inferring functional dependencies," Discrete Applied Mathematics 40 (1992): 237--243
- Martin and Osherson, Elements of Scientific Inquiry [A good introduction to the theory of formal learning, especially of recursive functions in the absence of noise. Not even hand-waving that this is a sensible idealization of what scientists do.]
- Zvika Marx, Ido Dagan and Joachim Buhmann, "Coupled Clustering: a Method for Detecting Structural Correspondence," cs.LG/0107032
- Geoffrey J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition
- Geoffrey J. McLachlan and David Peel, Finite Mixture Models
- Abraham Meidan and Boris Levin, "Choosing from Competing Theories in Computerised Learning", Minds and Machines 12 (2002): 119--129
- I. J. Myung, Vijay Balasubramanian and M. A. Pitt, "Counting probability distributions: Differential geometry and model selection", Proceedings of the National Academy of Sciences (USA) 97 (2000): 11170--11175
- National Research Council, Massive Data Sets [Online]
- Juan Pablo Neirotti and Nestor Caticha, "Dynamics of the evolution of learning algorithms by selection", Physical Review E 67 (2003): 041912
- O. Nelles, Nonlinear System Identification
- Ilya Nemenman, "Fluctuation-Dissipation Theorem and Models of Learning", Neural Computation 17 (2005): 2006--2033 ["We analyze how various abstract Bayesian learners perform on different data and argue that it is difficult to determine which learning-theoretic computation is performed by a particular organism using just its performance in learning a stationary target (learning curve). Based on the fluctuation-dissipation relation in statistical physics, we then discuss a different experimental setup that might be able to solve the problem."]
- Randall C. O'Reilly, "Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning," Neural Computation 13 (2001): 1199--1241
- Liam Paninski, "Asymptotic Theory of Information-Theoretic Experimental Design", Neural Computation 17 (2005): 1480--1507
- Hanchuan Peng, Fuhui Long and Chris Ding, "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy", IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005): 1226--1238 [This sounds like an idea I had in 2002, and was too dumb/lazy to follow up on.]
- Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau and Leslie Pack Kaelbling, "Learning to Cooperate via Policy Search," cs.LG/0105032
- Leonid Peshkin and Christian R. Shelton, "Learning from Scarce Experience," cs.AI/0204043
- Karl
Pfleger
- On-Line Learning of Undirected Sparse n-grams
- Learning Predictive Compositional Hierarchies [PS.gz]
- J. Pletonen and S. Kaski, "Discriminative Components of Data", IEEE Transactions on Neural Networks 16 (2005): 68--83
- Fenna H. Poletiek, Hypothesis Testing Behaviour [Review by Denny Borsboom]
- Joel B. Predd, Sanjeev R. Kulkarni and H. Vincent Poor
- "Consistency in Models for Distributed Learning under Communication Constraints", cs.IT/0503071
- "Distributed Learning in Wireless Sensor Networks", cs.IT/0503072
- Detlef Prescher, "A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars", cs.CL/0412015
- Foster Provost and Tom Fawcett, "Robust Classification for Imprecise Environments," cs.LG/0009007
- Vasin Punyakanok and Dan Roth, "The Use of Classifiers in Sequential Inference," cs.LG/0111003
- Dmitri A. Rachkovskij and Ernst M. Kussul, "Building large-scale hierarchical models of the world with binary sparse distributed representations", online
- Maxim Raginsky, "A complexity-regularized quantization approach to nonlinear dimensionality reduction", cs.IT/0501091
- Magnus Rattray, "Stochastic trapping in a solvable model of on-line independent component analysis," cond-mat/0105057
- G. Reents and R. Urbanczik, "Self-Averaging and On-Line Learning," cond-mat/9805339
- Dan Roth, "Learning in Natural Language: Theory and Algorithmic Approaches" [online]
- Hichem Sahbi and Donald Geman, "A Hierarchy of Support Vector Machines for Pattern Detection", Journal of Machine Learning Research 7 (2006): 2087--2123
- Erik Sandewall, Features and Fluents: The Representation of Knowledge about Dynamical systems
- Gerhard Schurz
- "Meta-Induction and the Prediction Game: A New View On Hume's Problem" [PDF preprint]
- "Patterns of Abduction" [PDF preprint]
- Fabrizio Sebastiani, "Machine Learning in Automated Text Categorization," cs.IR/0110053
- Aris Spanos
- "Statistical Induction, Severe Testing, and Model Validation" [Preprint]
- "Revisiting data mining: `hunting' with or without a license", Journal of Economic Methodology 7 (2000): 231--264 [PDF reprint]
- Peter Sollich and Anason Halees, "Learning curves for Gaussian process regression: Approximations and bounds," cond-mat/0105015
- Ray Solomonoff's Papers
- Qing Song, "A Robust Information Clustering Algorithm", Neural Computation 17 (2005): 2672--2698 ["We focus on the scenario of robust information clustering (RIC) based on the minimax optimization of mutual information (MI). The minimization of MI leads to the standard mass-constrained deterministic annealing clustering, which is an empirical risk-minimization algorithm. The maximization of MI works out an upper bound of the empirical risk via the identification of outliers (noisy data points). Furthermore, we estimate the real risk VC-bound and determine an optimal cluster number of the RIC based on the structural risk-minimization principle. One of the main advantages of the minimax optimization of MI is that it is a nonparametric approach, which identifies the outliers through the robust density estimate and forms a simple data clustering algorithm based on the square error of the Euclidean distance."]
- Eduardo D Sontag, "Adaptation Implies Internal Model," math.OC/0203228
- Susanne Still and William Bialek, "How many clusters? An information theoretic perspective," physics/0303011 = Neural Computation 16 (2004): 2483--2506
- Ron Sun and C. L. Giles (eds.), Sequence Learning: Paradigms, Algorithms, and Applications
- Eiji Takimoto and Akira Maruoka, "Top-down decision tree learning as information based boosting," Theoretical Computer Science 292 (2002): 447-464
- Sebastian Thrun and Lorien Pratt (eds.), Learning to Learn
- Robert Tibshirani and Larry Wasserman, "Correlation-sharing for detection of differential gene expression", math.ST/0608061 ["Our proposal averages the univariate scores of each feature with the scores in correlation neighborhoods. ... The general idea of correlation-sharing can be applied to other prediction problems involving a large number of correlated features."]
- Marc Toussaint, "Self-adaptive exploration in evolutionary search," physics/0102009
- Richard Turner, Maneesh Sahani, "A Maximum-Likelihood Interpretation for Slow Feature Analysis", Neural Computation 19 (2007): 1022-1038
- Peter D. Turney, "How to shift bias: Lessons from the Baldwin effect," Evolutionary Computation 4 (1996): 271-295 [online]
- D. Volk and M. G. Stepanov, "Resampling methods for document clustering," cond-mat/0109006
- Volodya Vovk, "Competitive On-Line Statistics" [PDF, via citeseer]
- Grace Wahba [All online
from Prof. Wahba's website]
- GW, "An introduction to (smoothing spline) ANOVA models in RKHS with examples in geographical data, medicine, atmospheric science and machine learning", math.ST/0410419
- GW, Xiwu Lin, Fangyu Gao, Dong Xiang, Ronald Klein and Barbara Klein, "The Bias-Variance Tradeoff and the Randomized GACV," Tech. Rep. 997 (1998)
- GW, Yi Lin and Hao Zhang, "Generalized Approximate Cross Validation for Support Vector MAchines, or, Another Way to Look at Margin-Like Quantities", Tech. Rep. 1006 (1999)
- GW, "Generalization and Regularization in Nonlinear Learning Systems," Tech. Rep. 1015 (2000)
- Yi Lin, Yoonkyung Lee and GW, "Support Vector Machines for Classification in Nonstandard Situations," Tech. Rep. 1016 (2000) [= Machine Learning 46 (2002): 191--202]
- W. Wang, P. Jones and D. Partridge, "A Comparative Study of Feature-Salience Ranking Techniques," Neural Computation 13 (2001): 1603--1623
- Satoshi Watanabe, Knowing and Guessing: A Quantitative Study of Inference and Information
- M. Brandon Westover and Joseph A. O'Sullivan, "Achievable Rates for Pattern Recognition", cs.IT/0509022 ["In this paper we describe a general mathematical model for pattern recognition systems subject to resource constraints, and show [that the] resource-complexity tradeoff can be characterized in terms of three rates related to number of bits available for representing memory and sensory data, and the number of patterns populating a given statistical environment. We prove single-letter information theoretic bounds governing the achievable rates, and illustrate the theory by analyzing the elementary cases where the pattern data is either binary or Gaussian."]
- K. Y. Michael Wong, S. Li and Peixun Luo, "Mean-Field Theory of Learning: From Dynamics to Statics," cond-mat/0006251
- Ying Yang, Xindong Wu and Xingquan Zhu, "Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams", Data Mining and Knowledge Discovery 13 (2006): 261--289
- H. Zha, X. He, C. Ding, M. Gu and H. Simon, "Bipartite Graph Partitioning and Data Clustering," cs.IR/0108018
- Baibo Zhang, Changshui Zhang and Xing Yi, "Competitive EM algorithm, for finite mixture models", Pattern Recognition 37 (2004): 131--144
- To write:
- CRS, Causal Architecture and Model Discovery: Theory, Algorithms and Examples
- CRS, "Three Kinds of Complexity in Prediction: Induction, Estimation and Calculation"
