Introduction to clustering large

and high-dimensional data


by Jacob Kogan

Cambridge University Press, 2007


Reviewed by:  Nicolas Loménie

Click here for Top of Page
Right Arrow: Next
Right Arrow: Previous

Although the book is entitled Introduction to clustering large and high-dimensional data, it focuses on the k-means numerical scheme and text mining applications. At first glance, one might consider it as  a challenge to write an interesting 200 page-long book with 149 references on such a narrow subject as the k-means algorithm. However, in the course of my research activity, I have come to practice the k-means scheme on stereoscopic data for visual computing in many more ways than those generally accepted in the computer science community. Arguing with colleaguesexperts in data mining and classification areasI have claimed that the k-means scheme is too often reduced to merely its basic, primal formulation as a quadratic distance-based algorithm used to discover structures. To me, the k-means scheme is a much more general and subtle scheme. And that is exactly the topic of this book.


This book consists of 11 chapters. Each chapter ends with thorough bibliographic notes and references. Chapter 1 introduces the topic of the book: clustering of sparse data in high-dimensional space, especially for document retrieval. Chapter 2 deals with the classic formulation of the quadratic k-means algorithm in Euclidean spaces. Chapter 3 is a brief chapter dedicated to the BIRCH algorithm that operates on large amounts of data, but where there are limitations on the amount of memory space. Chapter 4 deals with the spherical k-means algorithm, which is an adaptation of the k-means scheme to a particular space (called hypersphere) embedded in the Euclidean space and usually adopted in document retrieval applications. Chapters 5 to 8 broaden the classic quadratic k-means scheme to various formulations, demonstrating that this numerical scheme has broader applicability than is usually depicted in the scope of lectures or even research papers. Chapter 9 moves on to the issue of the assessment of clustering results. Finally, Chapters 10 and 11 give an interesting appendix on optimization and linear algebra backgrounds and solutions to selected problems/exercises raised all along in the preceding chapters.


The author is a professor in the department of Mathematics at the University of Maryland, Baltimore. Therefore, the book is a formal treatment of the topic with numerous definitions, theorems and lemmas. It also provides a lot of numerical experiments and discussions with simple examples to clarify the behaviors of this stimulating scheme. Hence, this book may serve as a useful reference for scientists and engineers who need to understand the concepts of clustering in general and/or to focus on text mining applications. It is also appropriate for students who are attending a course in pattern recognition, data mining, or classification and are interested in learning more about issues related to the k-means scheme for an undergraduate or master's thesis project. Last, it supplies very interesting material for instructors.


To improve the second edition, I would suggest :

¨ to give many more pseudo-code, ready-to-implement algorithms; or, at least, to make them more visible in the text.

¨ to provide many more references to the pattern recognition and computer science communities that have been facing these issues as well : a book like that of J. Bezdek et al. about fuzzy clustering, for example.


From a general point of view, it is interesting to note that, even in such a narrow scientific area, the community does not use a common vocabulary; for instance, term  fuzzy  is hardly written once in this book!

Click above to go to the publisher’s web page where you see a Description of the book, the Table of Contents, an Excerpt, the Index, Copyright information, and Frontmatter. 

Book Reviews Published in

the IAPR Newsletter


Dynamic Vision for Perception and Control of Motion

by Dicmkanns

             (see review in this issue)



by Polanski and Kimmel

             (see review in this issue)



The Text Mining Handbook

by Feldman and Sanger

             (see review in this issue)


Information Theory, Inference,

and Learning Algorithms

by Makay

                 (see review in this issue)


Geometric Tomography

by Gardner

           Oct ‘07   [html]     [pdf]


“Foundations and Trends in Computer Graphics and Vision”

Curless, Van Gool, and Szeliski., Editors

           Oct ‘07   [html]     [pdf]


Applied Combinatorics on Words

by M. Lothaire

           Jul ‘07    [html]     [pdf]



Human Identification Based on Gait

by Nixon, Tan and Chellappar

             Apr ‘07   [html]     [pdf]


Mathematics of Digital Images

by Stuart Hogan

             Apr ‘07   [html]     [pdf]


Advances in Image and Video Segmentation

Zhang, Editor

             Jan ‘07 [html]      [pdf]


Graph-Theoretic Techniques for Web Content Mining

by Schenker, Bunke, Last and Kandel

             Jan ‘07 [html]      [pdf]


Handbook of Mathematical Models in Computer Vision

by Paragios, Chen, and Faugeras (Editors)

           Oct ‘06     [html]     [pdf]


The Geometry of Information Retrieval

by van Rijsbergen

           Oct ‘06     [html]     [pdf]


Biometric Inverse Problems

by Yanushkevich, Stoica, Shmerko and Popel

           Oct ‘06     [html]     [pdf]


Correlation Pattern Recognition

by Kumar, Mahalanobis, and Juday

           Jul. ‘06     [html]     [pdf]


Pattern Recognition 3rd Edition

by Theodoridis and Koutroumbas

           Apr. ‘06    [html]     [pdf]


Dictionary of Computer Vision and

Image Processing

by R.B. Fisher, et. Al

           Jan. ‘06    [html]     [pdf]


Kernel Methods for Pattern Analysis

by Shawe-Taylor and Cristianini

           Oct. ‘05    [html]     [pdf]


Machine Vision Books

           Jul. ‘05     [html]     [pdf]


CVonline:  an overview

           Apr. ‘05    [html]     [pdf]


The Guide to Biometrics by Bolle, et al

           Jan. ‘05    [html]     [pdf]


Pattern Recognition Books

           Jul. ‘04                  [pdf]