Learning Theory:

An Approximation Theory Viewpoint


by Felipe Cucker, Ding Xuan Zhou

Cambridge Monographs on Applied & Computational Mathematics

Cambridge University Press; May 2007


Reviewed by: David Suter (Australia)

Click here for Top of Page
Right Arrow: Next
Right Arrow: Previous

Click above to go to the publisher’s web page where there is a description of the book, excerpt, Table of Contents, Index, and frontmatter.


I was looking forward to reviewing this book because I have long held the view that the “right way” to look at machine learning is the approximation theoretic one. Thus the title was certainly appealing. However, what I was looking for was a book that would convince the community at large that such was the way they should be thinking of machine learning. This is not that sort of book, unfortunately. However, for specialists, it will still be a useful addition to their bookshelves.

This is a book for mathematicians, or, at the very least, for those with a taste for, and a solid grounding in mathematics. Not only is it demanding in terms of the mathematical content, but it is also written in a very hard-core mathematical style. There are many theorems but little material to motivate the theorems, to interpret the theorems in useful ways (to the practitioner); indeed there is very little material helping the reader determine the significance of many of the theorems.

The stated aims of the book (quoting from the preface) are “to give a general overview of the theoretical foundations of learning theory…to emphasise a viewpoint that has drawn little attention …namely that of approximation theory. Second, …to attract mathematicians”. In my view, the book may not be successful at either, but the style of the book seems more aimed at the latter.

I will try to further characterise what the book actually does do.

Chapter 1 puts (or begins to put) certain familiar cases into the formal setting used by the book and approximation theorists. The cases are: regression, curve fitting, neural networks (a case, that many readers will perhaps be disappointed to find that the book says little that can be related to very easily), classification (of letters—hence multiple classes) and binary classification. The essence of the common characterisation is that there is a set X and a set Y and we are trying to recover a function f from X to Y. In regression and curve fitting, this is an easily recognised situation (domain and range of the function/curve being fitted). In the classification examples, Y is a discrete space (taking on only 26 distinct values or 2 distinct values, respectively; rather than say Y=R, the real line). One then takes samples from X Cartesian product Y (locations and function values, training samples and their labels) and tries to recover f by assuming something about f. This is the basic and relatively easy to understand setting. An immediate complication, a point that appears subtle and not commonly emphasised, is that one also has an associated probability measure on X Cartesian product Y. This measure both governs the samples (learning samples) produces and is also the measure used for characterising the deviation of the recovered function from the “true” one. We have to assume something about the regularity of f; otherwise we have no way of inferring the missing values between training samples. The restrictions we place on f (.) are where the book rapidly becomes one that demands a certain degree of mathematical background and sophistication. The assumed regularity makes f belong to a corresponding function space (Sobolev space, reproducing kernel Hilbert space or RKHS, Lipshitz, various restrictions of these spaces—such as compact balls in an RKHS, etc.).  The degree to which we may reasonably expect our recovered function to agree with the ground truth then depends on the number (and distribution of) samples and on the type of space. The rest of the book, is, in essence, providing theorems to characterise these dependencies.

The practitioner (even one with a taste for the theory and a desire for sound theoretical underpinnings) will likely find the results disappointing for several reasons. Firstly, the derived bounds—essentially the only `results’—are often very complex (lots of complex pre-conditions that must be satisfied before the theorem even applies and then a very complex looking formula with many terms and many parameters). Secondly, if/when she/he gets his/her head around the theorem (and I suspect only a fraction of readers will persevere to that point) there is the realisation that there is essentially no way in a practical setting to know (or implement a test to determine) whether it is reasonable to assume the functions and sampling probabilities (etc.) involved meet the preconditions. This, of course, is a problem not with this book per se but with much if not all of similar approximation theoretic settings and results.

Chapter 2 introduces some spaces from which the learnt function can be “selected”. Briefly treated are one finite dimensional space  (homogeneous polynomials), general linear function spaces; and then more in depth treatment of Sobolev spaces, RKHS spaces and certain subsets (compact balls) of RKHS spaces. It concludes with a note that, in the latter case, the solution can be found by convex quadratic programming. The chapter is interspersed with two “reminders”: one on function spaces and one on convex programming. Personally, I would rather these were moved to appendices as they disturb the flow when the reader tries to reconstruct the thematic flow/main message.

Chapter 3 produces estimates of the probability of error (technically, the sampling error) in terms of the number of samples and in terms of two technical concepts related to the assumed function space—the M-bound and the covering number. It immediately states what is, I deduced, the main theorem (3.3) but does not explicitly state that the rest of the chapter is largely devoted to a derivation of that theorem. I was only reasonably certain of that after 10 pages when I encountered “toward the proof of theorem 3.3”—sealed by, 3 pages later, the proof being announced. The remarks at the end of the chapter hint at generalisations using loss functions.

Chapter 4 derives results for the “approximation error” (one of the quite subtle things that may challenge the reader is to clearly understand the distinction between this and the sampling error treated in the previous chapter). The distinction appeared to me to be buried in a mass of definitions from page 5-8, and I would have appreciated some more helpful exposition/discussion on how the emphasis here relates to that of chapter 3.

Chapter 5 seems to discontinue the theme of chapter 4 and return to that of chapter 3, giving some ways to estimate the covering number required to make explicit the bounds derived in chapter 3.

Chapter 6 seems to then restart with the themes/concerns of chapter 4. This order seems a bit strange. Why not place the material in chapter 5 (immediately) after chapter 3 and likewise for chapters 4 and 6?

Chapter 7 is a very short chapter: focussing in the relatively well-known bias-variance problem. This chapter, at least, does depend directly on all of the preceding chapters.

Chapter 8 shifts foundation/setting away from insisting the functions be drawn from a compact space. The potential price (and dangerous one) would be that overfitting might be a likely result. A penalized (regularisation) formulation is imposed to try to counteract this.  Once more the majority of the chapter seems to be devoted to proving a theorem (8.1). I found it disappointing that, considering this was a shift of foundation, there wasn’t direct cross-referencing between the theorems in this chapter and the corresponding results in previous chapters (with the alternative foundations).  Indeed, I found it pretty impossible to make my own comparisons in this regards.

Chapter 9 looks at the special case where Y is binary and treats SVM’s as a special case of such as setting.

Finally, Chapter 10 involves generalisations of the settings in chapter 9. Once again the chapter starts with a theorem (well actually two—10.2 and 10.3) without proof and most of the chapter is apparently devoted to building up to the proofs. In this case, again, it isn’t clearly signposted that this is happening and where the proofs are made. About 20 pages into this chapter, we simply have the unadorned statements to the effect that the proofs now follow from some just introduced corollaries.

As I’ve indicated above, the book is hard to read. This is almost all because of, not surprisingly, the degree of mathematical sophistication and effort required. In the main, the prose is easy to read (though, as stated, it is minimal). There are a few minor typos (e.g., “a standand (sic) theme” p. 133), only one rather odd grammatical construction I spotted (p. 188 “The main result of this chapter, Theorem 10.24, this goal achieves for various kernels K and classifying loss functions”), and few production errors. The book seems well prepared and polished from this viewpoint.  However, let me declare, should the reader not already be sufficiently aware, that I did not take the trouble to check many of the hundreds of equations and derivations! Also, as I have indicated, I don’t think the main chapter order is perhaps the best.

I am reasonably sure I will find the book a useful reference; and, albeit by some very hard work, it has already helped me gain a better sense of certain parts of approximation theoretic treatments of machine learning.  However it is not a book I would highly recommend.

Book Reviews Published in

the IAPR Newsletter


Close Range Photogrammetry:  Principles, Methods, and Applications

by Luhmann, Robson, Kyle, and Harley

             (see review in this issue)


Classification and Learning Using Genetic Algorithms: Applications in Bioinformatics and Web Intelligence

by  Bandyopadhyay and Pal

             (see review in this issue)


Character Recognition Systems—A Guide for Students and Practitioners

by Cheriet, Kharma, Liu, and Suen

             (see review in this issue)


Geometry of Locally Finite Spaces

by Kovalevsky

             (see review in this issue)


Machine Learning in Document Analysis and Recognition

by Marinai and  Fujisawa (Editors)

             (see review in this issue)


From Gestalt Theory to Image Analysis—A Probabilistic Approach

By Desolneux, Moisan, and Morel

             (see review in this issue)


Numerical Recipes:  The art of scientific computing, 3rd ed.

by Press, Teukolsky, Vetterling and Flannery

             Jul ‘08    [html]     [pdf]


Feature Extraction and Image Processing, 2nd ed.

by Nixon and Aguado

             Jul ‘08    [html]     [pdf]


Digital Watermarking and Steganography:

Fundamentals and Techniques

by Shih

             Jul ‘08    [html]     [pdf]


Springer Handbook of Speech Processing

by Benesty, Sondhi, and Huang, eds.

             Jul ‘08    [html]     [pdf]


Digital Image Processing: An Algorithmic Introduction Using Java

by Burger and Burge

             Jul ‘08    [html]     [pdf]


Bézier and Splines in Image Processing and Machine Vision

by Biswas and Lovell

             Jul ‘08    [html]     [pdf]


Practical Algorithms for Image Analysis, 2 ed.

by  O’Gorman, Sammon and Seul

             Apr ‘08   [html]     [pdf]


The Dissimilarity Representation for Pattern Recognition:  Foundations and Applications

by Pekalska and Duin

             Apr ‘08   [html]     [pdf]


Handbook of Biometrics

by Jain, Flynn, and Ross (Editors)

             Apr ‘08   [html]     [pdf]


Advances in Biometrics –

Sensors, Algorithms, and Systems

by Ratha and Govindaraju, (Editors)

             Apr ‘08   [html]     [pdf]


Dynamic Vision for Perception and Control of Motion

by Dickmanns

             Jan ‘08   [html]     [pdf]



by Polanski and Kimmel

             Jan ‘08   [html]     [pdf]


Introduction to clustering large and high-dimensional data

by Kogan

             Jan ‘08   [html]     [pdf]


The Text Mining Handbook

by Feldman and Sanger

             Jan ‘08   [html]     [pdf]


Information Theory, Inference,

and Learning Algorithms

by Makay

             Jan ‘08   [html]     [pdf]


Geometric Tomography

by Gardner

           Oct ‘07   [html]     [pdf]


“Foundations and Trends in Computer Graphics and Vision”

Curless, Van Gool, and Szeliski., Editors

           Oct ‘07   [html]     [pdf]


Applied Combinatorics on Words

by M. Lothaire

           Jul ‘07    [html]     [pdf]



Human Identification Based on Gait

by Nixon, Tan and Chellappar

             Apr ‘07   [html]     [pdf]


Mathematics of Digital Images

by Stuart Hogan

             Apr ‘07   [html]     [pdf]


Advances in Image and Video Segmentation

Zhang, Editor

             Jan ‘07 [html]      [pdf]


Graph-Theoretic Techniques for Web Content Mining

by Schenker, Bunke, Last and Kandel

             Jan ‘07 [html]      [pdf]


Handbook of Mathematical Models in Computer Vision

by Paragios, Chen, and Faugeras (Editors)

           Oct ‘06     [html]     [pdf]


The Geometry of Information Retrieval

by van Rijsbergen

           Oct ‘06     [html]     [pdf]


Biometric Inverse Problems

by Yanushkevich, Stoica, Shmerko and Popel

           Oct ‘06     [html]     [pdf]


Correlation Pattern Recognition

by Kumar, Mahalanobis, and Juday

           Jul. ‘06     [html]     [pdf]


Pattern Recognition 3rd Edition

by Theodoridis and Koutroumbas

           Apr. ‘06    [html]     [pdf]


Dictionary of Computer Vision and

Image Processing

by R.B. Fisher, et. Al

           Jan. ‘06    [html]     [pdf]


Kernel Methods for Pattern Analysis

by Shawe-Taylor and Cristianini

           Oct. ‘05    [html]     [pdf]


Machine Vision Books

           Jul. ‘05     [html]     [pdf]


CVonline:  an overview

           Apr. ‘05    [html]     [pdf]


The Guide to Biometrics by Bolle, et al

           Jan. ‘05    [html]     [pdf]


Pattern Recognition Books

           Jul. ‘04                  [pdf]