The Text Mining Handbook
by Ronen Feldman and James Sanger
Cambridge University Press, 2007
Reviewed by L. Venkata Subramaniam
Text mining today covers a broad range of topics. This handbook gives a high-level perspective of text mining by covering many of the important topics. The handbook is aimed at a wide spectrum of audiences comprising students, academic researchers and professional practitioners.
The first two chapters of the book provide an introduction to text mining and the operations involved in doing text mining. Chapter I presents text mining definitions. It also gives the general architecture of a text mining system. Chapter II presents core text mining operations. This chapter covers various pattern-discovery algorithms.
The next six chapters present basic preprocessing techniques in text mining. Chapter III presents an extremely brief introduction to linguistic preprocessing techniques in text mining. Chapter IV covers text categorization. Chapter V looks at text clustering. Chapter VI covers information extraction (IE). These chapters cover the main definitions and techniques. Chapter VII covers probabilistic models for information extraction. Chapter VIII presents the applications of the probabilistic models presented in the previous chapter to different IE tasks. In particular, hidden Markov models, stochastic context free grammars, and maximal entropy are covered from the mathematical perspective, and their application to IE is given in these two chapters.
The next two chapters cover the user interface part of text mining systems. Chapter IX looks at aspects related to browsing large text collections. Chapter X covers visualization approaches to view the text document collections and the results obtained from various text mining operations on document collections.
In Chapter XI the topic of link analysis is covered. In this chapter, techniques to analyze large networks of entities are presented. The work in the first eight chapters talked about how the entities can be extracted from the text. In this chapter the focus is on finding specific patterns within the network of entities.
Finally, in Chapter XII, real-world applications are presented. Text mining systems in the areas of corporate finance, patent research, and life sciences are presented.
The Appendix explains DIAL (declarative information analysis language). This is a dedicated information extraction language.
There are notes at the end of each chapter that discuss related work. This is very helpful in placing the work of the chapter in context and for looking up related work to gain better understanding. There is a common bibliography at the end of the book.
One topic that I think the authors should have but didn’t cover at all is text mining in the presence of noise. Real world user-generated text data is noisy and today it is important to deal with it. Blogs, newsgroup postings, emails and other such spontaneously written text found in abundance is very noisy. Further, there is also deliberately added noise in the form of spams and splogs. From my perspective, as a text mining practitioner, I would have liked to see some coverage of this. But that is something for the authors to add in the next edition. .
The authors in their preface have mentioned that they have tried to blend together theory and practice by providing many real-life scenarios that show how the different techniques are used in practice. I think they have largely succeeded in doing that. They have addressed the needs of both developers and users of text mining systems.
My recommendation to the readers is to buy the book. This book is definitely worth having in your book shelf as a handy reference.
Click above to go to the publisher’s web page where you see a Description of the book, the Table of Contents, an Excerpt, the Index, Copyright information, and Frontmatter.
Book Reviews Published in
the IAPR Newsletter
Dynamic Vision for Perception and Control of Motion
by Polanski and Kimmel
Introduction to clustering large and high-dimensional data
Information Theory, Inference,
and Learning Algorithms
“Foundations and Trends in Computer Graphics and Vision”
Curless, Van Gool, and Szeliski., Editors
Applied Combinatorics on Words
by M. Lothaire
Human Identification Based on Gait
by Nixon, Tan and Chellappar
Mathematics of Digital Images
by Stuart Hogan
Advances in Image and Video Segmentation
Graph-Theoretic Techniques for Web Content Mining
by Schenker, Bunke, Last and Kandel
Handbook of Mathematical Models in Computer Vision
by Paragios, Chen, and Faugeras (Editors)
The Geometry of Information Retrieval
by van Rijsbergen
Biometric Inverse Problems
by Yanushkevich, Stoica, Shmerko and Popel
Correlation Pattern Recognition
by Kumar, Mahalanobis, and Juday
Pattern Recognition 3rd Edition
by Theodoridis and Koutroumbas
Dictionary of Computer Vision and
by R.B. Fisher, et. Al
Kernel Methods for Pattern Analysis
by Shawe-Taylor and Cristianini
Machine Vision Books
CVonline: an overview
The Guide to Biometrics by Bolle, et al
Pattern Recognition Books
Jul. ‘04 [pdf]