An Approximation Theory Viewpoint
by Felipe Cucker, Ding Xuan Zhou
Cambridge Monographs on Applied & Computational Mathematics
Cambridge University Press; May 2007
Reviewed by: David Suter (Australia)
Click above to go to the publisher’s web page where there is a description of the book, excerpt, Table of Contents, Index, and frontmatter.
I was looking forward to reviewing this book because I have long held the view that the “right way” to look at machine learning is the approximation theoretic one. Thus the title was certainly appealing. However, what I was looking for was a book that would convince the community at large that such was the way they should be thinking of machine learning. This is not that sort of book, unfortunately. However, for specialists, it will still be a useful addition to their bookshelves.
This is a book for mathematicians, or, at the very least, for those with a taste for, and a solid grounding in mathematics. Not only is it demanding in terms of the mathematical content, but it is also written in a very hard-core mathematical style. There are many theorems but little material to motivate the theorems, to interpret the theorems in useful ways (to the practitioner); indeed there is very little material helping the reader determine the significance of many of the theorems.
The stated aims of the book (quoting from the preface) are “to give a general overview of the theoretical foundations of learning theory…to emphasise a viewpoint that has drawn little attention …namely that of approximation theory. Second, …to attract mathematicians”. In my view, the book may not be successful at either, but the style of the book seems more aimed at the latter.
I will try to further characterise what the book actually does do.
Chapter 1 puts (or begins to put) certain familiar cases into the formal setting used by the book and approximation theorists. The cases are: regression, curve fitting, neural networks (a case, that many readers will perhaps be disappointed to find that the book says little that can be related to very easily), classification (of letters—hence multiple classes) and binary classification. The essence of the common characterisation is that there is a set X and a set Y and we are trying to recover a function f from X to Y. In regression and curve fitting, this is an easily recognised situation (domain and range of the function/curve being fitted). In the classification examples, Y is a discrete space (taking on only 26 distinct values or 2 distinct values, respectively; rather than say Y=R, the real line). One then takes samples from X Cartesian product Y (locations and function values, training samples and their labels) and tries to recover f by assuming something about f. This is the basic and relatively easy to understand setting. An immediate complication, a point that appears subtle and not commonly emphasised, is that one also has an associated probability measure on X Cartesian product Y. This measure both governs the samples (learning samples) produces and is also the measure used for characterising the deviation of the recovered function from the “true” one. We have to assume something about the regularity of f; otherwise we have no way of inferring the missing values between training samples. The restrictions we place on f (.) are where the book rapidly becomes one that demands a certain degree of mathematical background and sophistication. The assumed regularity makes f belong to a corresponding function space (Sobolev space, reproducing kernel Hilbert space or RKHS, Lipshitz, various restrictions of these spaces—such as compact balls in an RKHS, etc.). The degree to which we may reasonably expect our recovered function to agree with the ground truth then depends on the number (and distribution of) samples and on the type of space. The rest of the book, is, in essence, providing theorems to characterise these dependencies.
The practitioner (even one with a taste for the theory and a desire for sound theoretical underpinnings) will likely find the results disappointing for several reasons. Firstly, the derived bounds—essentially the only `results’—are often very complex (lots of complex pre-conditions that must be satisfied before the theorem even applies and then a very complex looking formula with many terms and many parameters). Secondly, if/when she/he gets his/her head around the theorem (and I suspect only a fraction of readers will persevere to that point) there is the realisation that there is essentially no way in a practical setting to know (or implement a test to determine) whether it is reasonable to assume the functions and sampling probabilities (etc.) involved meet the preconditions. This, of course, is a problem not with this book per se but with much if not all of similar approximation theoretic settings and results.
Chapter 2 introduces some spaces from which the learnt function can be “selected”. Briefly treated are one finite dimensional space (homogeneous polynomials), general linear function spaces; and then more in depth treatment of Sobolev spaces, RKHS spaces and certain subsets (compact balls) of RKHS spaces. It concludes with a note that, in the latter case, the solution can be found by convex quadratic programming. The chapter is interspersed with two “reminders”: one on function spaces and one on convex programming. Personally, I would rather these were moved to appendices as they disturb the flow when the reader tries to reconstruct the thematic flow/main message.
Chapter 3 produces estimates of the probability of error (technically, the sampling error) in terms of the number of samples and in terms of two technical concepts related to the assumed function space—the M-bound and the covering number. It immediately states what is, I deduced, the main theorem (3.3) but does not explicitly state that the rest of the chapter is largely devoted to a derivation of that theorem. I was only reasonably certain of that after 10 pages when I encountered “toward the proof of theorem 3.3”—sealed by, 3 pages later, the proof being announced. The remarks at the end of the chapter hint at generalisations using loss functions.
Chapter 4 derives results for the “approximation error” (one of the quite subtle things that may challenge the reader is to clearly understand the distinction between this and the sampling error treated in the previous chapter). The distinction appeared to me to be buried in a mass of definitions from page 5-8, and I would have appreciated some more helpful exposition/discussion on how the emphasis here relates to that of chapter 3.
Chapter 5 seems to discontinue the theme of chapter 4 and return to that of chapter 3, giving some ways to estimate the covering number required to make explicit the bounds derived in chapter 3.
Chapter 6 seems to then restart with the themes/concerns of chapter 4. This order seems a bit strange. Why not place the material in chapter 5 (immediately) after chapter 3 and likewise for chapters 4 and 6?
Chapter 7 is a very short chapter: focussing in the relatively well-known bias-variance problem. This chapter, at least, does depend directly on all of the preceding chapters.
Chapter 8 shifts foundation/setting away from insisting the functions be drawn from a compact space. The potential price (and dangerous one) would be that overfitting might be a likely result. A penalized (regularisation) formulation is imposed to try to counteract this. Once more the majority of the chapter seems to be devoted to proving a theorem (8.1). I found it disappointing that, considering this was a shift of foundation, there wasn’t direct cross-referencing between the theorems in this chapter and the corresponding results in previous chapters (with the alternative foundations). Indeed, I found it pretty impossible to make my own comparisons in this regards.
Chapter 9 looks at the special case where Y is binary and treats SVM’s as a special case of such as setting.
Finally, Chapter 10 involves generalisations of the settings in chapter 9. Once again the chapter starts with a theorem (well actually two—10.2 and 10.3) without proof and most of the chapter is apparently devoted to building up to the proofs. In this case, again, it isn’t clearly signposted that this is happening and where the proofs are made. About 20 pages into this chapter, we simply have the unadorned statements to the effect that the proofs now follow from some just introduced corollaries.
As I’ve indicated above, the book is hard to read. This is almost all because of, not surprisingly, the degree of mathematical sophistication and effort required. In the main, the prose is easy to read (though, as stated, it is minimal). There are a few minor typos (e.g., “a standand (sic) theme” p. 133), only one rather odd grammatical construction I spotted (p. 188 “The main result of this chapter, Theorem 10.24, this goal achieves for various kernels K and classifying loss functions”), and few production errors. The book seems well prepared and polished from this viewpoint. However, let me declare, should the reader not already be sufficiently aware, that I did not take the trouble to check many of the hundreds of equations and derivations! Also, as I have indicated, I don’t think the main chapter order is perhaps the best.
I am reasonably sure I will find the book a useful reference; and, albeit by some very hard work, it has already helped me gain a better sense of certain parts of approximation theoretic treatments of machine learning. However it is not a book I would highly recommend.
Book Reviews Published in
the IAPR Newsletter
Close Range Photogrammetry: Principles, Methods, and Applications
by Luhmann, Robson, Kyle, and Harley
Classification and Learning Using Genetic Algorithms: Applications in Bioinformatics and Web Intelligence
by Bandyopadhyay and Pal
Character Recognition Systems—A Guide for Students and Practitioners
by Cheriet, Kharma, Liu, and Suen
Geometry of Locally Finite Spaces
Machine Learning in Document Analysis and Recognition
by Marinai and Fujisawa (Editors)
From Gestalt Theory to Image Analysis—A Probabilistic Approach
By Desolneux, Moisan, and Morel
Numerical Recipes: The art of scientific computing, 3rd ed.
by Press, Teukolsky, Vetterling and Flannery
Feature Extraction and Image Processing, 2nd ed.
by Nixon and Aguado
Digital Watermarking and Steganography:
Fundamentals and Techniques
Springer Handbook of Speech Processing
by Benesty, Sondhi, and Huang, eds.
Digital Image Processing: An Algorithmic Introduction Using Java
by Burger and Burge
Bézier and Splines in Image Processing and Machine Vision
by Biswas and Lovell
Practical Algorithms for Image Analysis, 2 ed.
by O’Gorman, Sammon and Seul
The Dissimilarity Representation for Pattern Recognition: Foundations and Applications
by Pekalska and Duin
Handbook of Biometrics
by Jain, Flynn, and Ross (Editors)
Advances in Biometrics –
Sensors, Algorithms, and Systems
by Ratha and Govindaraju, (Editors)
Dynamic Vision for Perception and Control of Motion
by Polanski and Kimmel
Introduction to clustering large and high-dimensional data
The Text Mining Handbook
by Feldman and Sanger
Information Theory, Inference,
and Learning Algorithms
“Foundations and Trends in Computer Graphics and Vision”
Curless, Van Gool, and Szeliski., Editors
Applied Combinatorics on Words
by M. Lothaire
Human Identification Based on Gait
by Nixon, Tan and Chellappar
Mathematics of Digital Images
by Stuart Hogan
Advances in Image and Video Segmentation
Graph-Theoretic Techniques for Web Content Mining
by Schenker, Bunke, Last and Kandel
Handbook of Mathematical Models in Computer Vision
by Paragios, Chen, and Faugeras (Editors)
The Geometry of Information Retrieval
by van Rijsbergen
Biometric Inverse Problems
by Yanushkevich, Stoica, Shmerko and Popel
Correlation Pattern Recognition
by Kumar, Mahalanobis, and Juday
Pattern Recognition 3rd Edition
by Theodoridis and Koutroumbas
Dictionary of Computer Vision and
by R.B. Fisher, et. Al
Kernel Methods for Pattern Analysis
by Shawe-Taylor and Cristianini
Machine Vision Books
CVonline: an overview
The Guide to Biometrics by Bolle, et al
Pattern Recognition Books
Jul. ‘04 [pdf]