Books 3

BOOKSBOOKSBOOKS

Introduction to clustering large

and high-dimensional data

by Jacob Kogan

Cambridge University Press, 2007

Reviewed by: Nicolas Loménie

Click here for Top of Page

Return to Home Page

Click here for

Free Books!

Although the book is entitled Introduction to clustering large and high-dimensional data, it focuses on the k-means numerical scheme and text mining applications. At first glance, one might consider it as a challenge to write an interesting 200 page-long book with 149 references on such a narrow subject as the k-means algorithm. However, in the course of my research activity, I have come to practice the k-means scheme on stereoscopic data for visual computing in many more ways than those generally accepted in the computer science community. Arguing with colleagues—experts in data mining and classification areas—I have claimed that the k-means scheme is too often reduced to merely its basic, primal formulation as a quadratic distance-based algorithm used to discover structures. To me, the k-means scheme is a much more general and subtle scheme. And that is exactly the topic of this book.

This book consists of 11 chapters. Each chapter ends with thorough bibliographic notes and references. Chapter 1 introduces the topic of the book: clustering of sparse data in high-dimensional space, especially for document retrieval. Chapter 2 deals with the classic formulation of the quadratic k-means algorithm in Euclidean spaces. Chapter 3 is a brief chapter dedicated to the BIRCH algorithm that operates on large amounts of data, but where there are limitations on the amount of memory space. Chapter 4 deals with the spherical k-means algorithm, which is an adaptation of the k-means scheme to a particular space (called hypersphere) embedded in the Euclidean space and usually adopted in document retrieval applications. Chapters 5 to 8 broaden the classic quadratic k-means scheme to various formulations, demonstrating that this numerical scheme has broader applicability than is usually depicted in the scope of lectures or even research papers. Chapter 9 moves on to the issue of the assessment of clustering results. Finally, Chapters 10 and 11 give an interesting appendix on optimization and linear algebra backgrounds and solutions to selected problems/exercises raised all along in the preceding chapters.

The author is a professor in the department of Mathematics at the University of Maryland, Baltimore. Therefore, the book is a formal treatment of the topic with numerous definitions, theorems and lemmas. It also provides a lot of numerical experiments and discussions with simple examples to clarify the behaviors of this stimulating scheme. Hence, this book may serve as a useful reference for scientists and engineers who need to understand the concepts of clustering in general and/or to focus on text mining applications. It is also appropriate for students who are attending a course in pattern recognition, data mining, or classification and are interested in learning more about issues related to the k-means scheme for an undergraduate or master's thesis project. Last, it supplies very interesting material for instructors.

To improve the second edition, I would suggest :

¨ to give many more pseudo-code, ready-to-implement algorithms; or, at least, to make them more visible in the text.

¨ to provide many more references to the pattern recognition and computer science communities that have been facing these issues as well : a book like that of J. Bezdek et al. about fuzzy clustering, for example.

From a general point of view, it is interesting to note that, even in such a narrow scientific area, the community does not use a common vocabulary; for instance, term fuzzy is hardly written once in this book!

Click above to go to the publisher’s web page where you see a Description of the book, the Table of Contents, an Excerpt, the Index, Copyright information, and Frontmatter.

Book Reviews Published in

the IAPR Newsletter

Dynamic Vision for Perception and Control of Motion

by Dicmkanns

(see review in this issue)

Bioinformatics

by Polanski and Kimmel

(see review in this issue)

The Text Mining Handbook

by Feldman and Sanger

(see review in this issue)

Information Theory, Inference,