Many qualities of this book stand out even before one begins to read it. The book is framed in a more general context than are books on Support Vector Machines It not only covers kernel pattern classification and regression (a topic covered at length in SVM books) but also kernel principal components analysis, canonical correlation analysis, ranking, clustering, multidimensional scaling, and other pattern analysis methods. Basic theoretical concepts (in Part I), pattern analysis algorithms (in Part II), and various types of kernels (in Part III) are all presented in this large unifying framework of kernel methods. The methods described are applicable to a very large variety of application domains (e.g., bioinformatics, biometrics, data mining, document analysis, information retrieval, machine vision, text categorization, and robotics, to just name a few). Further, many of the algorithms and kernels presented in Parts II and III are accompanied by Matlab code or pseudocode, which would be welcomed by practitioners in bridging the gap between understanding of the theoretical framework and writing a correct implementation.

The basic concepts of pattern analysis achieving data compression and prediction by finding relations (exact, approximate, or statistical) in the data are introduced in Chapter 1. A toy example demonstrates the concept of transforming the representation of a pattern by changing the coordinate system, a concept that is central to kernel systems. Many definitions, terminology, mathematical notation that will be used throughout the book are introduced here together with basic pattern recognition concepts such as generalization, over-fitting, supervised learning, unsupervised learning, etc. Chapter 2 provides an introduction to the general framework of kernel methods. Raw data is embedded into (higher dimensional) feature space and linear relations are learned in the (higher dimensional) feature space (often by solving an optimization problem). Only pair-wise inner products between data items are needed (obtained by kernel functions), thus providing computational efficiency. These concepts are illustrated using a toy linear regression example. Overview/organization of the rest of the book is introduced to the reader as a walk-through. Chapter 3 describes the theoretical properties of kernels (including some review of the involved linear algebra) in a general framework (without regard to any specific kernel type or pattern analysis algorithm). Part I of the book concludes with Chapter 4, where a framework for understanding how kernels’ statistical stability can be controlled is developed.

While readers whose first exposure to kernel methods is this book will find Part I of the book useful, most practitioners could go directly to Part II that deals with pattern analysis algorithms and/or Part III that deals with types of kernels. Chapter 5 prepares the reader for the rest of the chapters (6,7, and 8) in Part II by introducing basic tools (e.g., mean, distances, projections, variance, etc.) for analyzing data in a kernel-defined feature space. Chapter 6, pattern analysis using eigen-decomposition, covers kernel Principal Components Analysis (PCA), kernel Partial Least Squares (PLS), and kernel Canonical Correlation Analysis (CCA). Primal PCA, PLS, and CCA are also described in brief for reference and completeness. Kernel PCA, PLS, and CCA are fairly new techniques and have become hugely popular in the community due to providing better accuracy in learning tasks as compared to linear algorithms (this is not surprising since most real world data cannot be described using linear relations). Kernel methods can also often deal better with very high dimensional input data. Chapter 7 covers support vector machine classification and support vector regression algorithms as well as novelty detectors and ridge regression. These algorithms result from optimization problems; optimization of the desired criteria is cast in the framework of convex optimization. Development of simple linearly separable classification and regression cases are followed by more complex soft margin cases for non-separable data. The authors also present on-line (incremental) learning versions of the batch algorithms. On-line/incremental learning algorithms are desired by a number of practical applications. Interestingly the generalization bound for on-line algorithms is at least as good as the margin bound obtained for the hard margin SVM (although in practice, batch SVM typically gives better generalization). Kernel-based rank learning and kernel-based clustering algorithms are considered in Chapter 8. Batch and on-line versions of the kernel-based ranking algorithms are considered as well as kernel-based, k-means, and spectral clustering algorithms. Kernel Multi-Dimensional Scaling (MDS) is considered in a section dedicated to kernel methods for data visualization.

Part III of the book consists of chapters 9, 10, 11, and 12 that are devoted to the design of a wide range of kernel functions. Simple polynomial kernels, popular infinite-dimensional Gaussian kernels, ANOVA kernels and computational considerations are covered in Chapter 9. Description of ANOVA kernels is particularly interesting, since this is the first example of kernels defined by means of a recursive relation and efficiently evaluated using dynamic programming. Different types of data domains require different kernels attuned to the data type/domain. Kernels from graph, diffusion kernels on graph nodes, kernels on sets, kernels on real numbers, and randomized kernels are covered in Chapter 9. Chapter 10 is dedicated to kernels for text data types (e.g., in information retrieval applications). The discussion of Vector Space Model (VSM), also called ‘bag-of-words’ (where a document is represented by the unordered set of constituent words and the frequency of their occurrence) is interesting, since the vector representation can have tens of thousands of entries/dimension (often more than the number of training examples) but is extremely sparse (only few entries are non-zero). The chapter discusses vector space kernels as well as its refinements that take semantics into account. Kernels for structured data such as strings and trees (e.g., in applications such as bioinformatics) are covered in Chapter 11. A kernel that sums the number of common substrings/subsequences can be used to compare strings and can be efficiently implemented using dynamic programming. Kernels for trees are similar to kernels for strings as both exploit recursive structure (of strings and trees, respectively) using dynamic programming. Chapter 12 examines how probability of co-occurrence or Fisher kernel construct can be used to create kernels from generative models of data. Fixed length Hidden Markov Model (HMM) kernel, pair HMM kernel, hidden tree model kernel, and fixed length Markov model Fisher kernel are discussed here.

Overall I enjoyed reading this book and am happy about its addition to my library as it is a valuable practitioner’s reference. I especially liked the presentation of kernel-based pattern analysis algorithms in terse mathematical steps clearly identifying input data, output data, and steps of the process. The accompanying Matlab code or pseudocode is also extremely useful. I would have liked to see more toy examples with figures depicting not only the results of straightforward implementation of kernel pattern analysis algorithms but also illustrating the effects of different kernels/parameters, over-fitting and regularization. A section on authors’ expert suggestions to practitioners on the choice of kernel type, kernel parameters, learning algorithm parameters, and data normalization would also have been welcomed by practitioners.