The World Wide Web contains a massive amount of unstructured information.
Web content mining involves processing these web pages through clustering (locating documents that are similar to each other) and classification (assigning documents to pre-determined classes). These methods can be useful for both visualizing and browsing through the structure of a collection of web documents. For example, one commercial application that uses this approach can be found at www.leximancer.com. In this book, Adam Schenker et al. describe how to apply methods from graph theory to web content mining. The advantage of this graph-based approach is that information that may be lost in the traditional vector space model (VSM) approach can be included (such as term order and document section).
Graph-Theoretic Techniques for Web Content Mining is a reasonably short book (235 pages) consisting of seven main chapters followed by 70 pages of appendices. The book appears to be based on the first author’s PhD thesis.
The first chapter provides a brief review of previous web mining research and an overview of the traditional information retrieval vector space model.
The second chapter is a carefully written presentation of graph similarity techniques. These techniques are used through the remainder of the book and include: graph and subgraph isomorphism; graph edit distance; state space search, probabilistic and distance approaches; and locating graph mean and medians.
Chapter three is quite short (9 pages). However, it discusses a number of different methods for representing web sites as graphs. Firstly, pre-processing is described (stop-lists and stemming), along with a brief section on latent semantic indexing. These are followed by a section on different types of representation (such as n-distance and absolute distance). Because calculating the distance between graphs is somewhat complicated, a short section on Schenker et al.’s method and some associated complexity analysis is included.
The focus of the fourth chapter is the author’s extension of the popular k-means clustering algorithm to graphs. This graph-based algorithm performs well compared to the other method, especially when a large number of nodes are included. As the authors point out, one advantage of a graph-based approach is that when a new term is added only the new node and edges need to be added to the particular graph. In contrast, a VSM approach would require an increase in dimensionality (by one) for every document in the collection.
Chapter five presents a graph-based extension to the k-nearest neighbors classification algorithm. Experimental results are given comparing the traditional VSM to the new approach. The results indicate that the graph-based method can be accurate and take approximately the same amount of classification time.
Chapter six describes a web based clustering system which accepts a query, extracts relevant web sites from a standard search engine (Google and Alta Vista were used in this book), and then organizes the resulting web documents into hierarchies of concepts from the documents. These cluster hierarchies attempt to represent the knowledge from these documents. The authors’ system is compared (favorably) with the Grouper and Vivisimo systems. Advantages of this approach are that it enables users to focus on specific areas of interest (which could speed up web searching), and also suggests how a search query topic relates to other topics. This was a well written chapter, with clear examples and results.
The final chapter is a short three pages and presents conclusions and some ideas for future work. There seems to be plenty of scope to develop the ideas from this book, for example by processing information from tags in html documents. I think it would be interesting to extend the work to other types of web content, such as images and music.
Two appendices are included. The first, which is 60 pages in length, presents examples of creating graphs from web sites. The rendered html, html source, and some graph portions are provided for three different web documents. These examples are interesting and clearly demonstrate the sometimes complex relationships between document terms. However, these examples are based on ‘live’ URLs (rather than the F-, J-, and K-series datasets used throughout the other chapters). A quick check found that two of the three web sites have been updated since this book was published.
The second appendix is a list of stop words used. A minor complaint here is that it is not clear why the authors have chosen these words rather than use a standard stop list (such as the SMART list).
In summary, I found this an interesting book, which includes a number of contributions to the web content mining and avenues for future work. The algorithms and math are presented clearly and thoughtful examples are usually provided. Some further editing would have been beneficial. For example, a the chapters use a different reference style (e.g., [Lov68]) from the bibliography (which uses the familiar IEEE style) making it difficult to locate references. More importantly some references mentioned in the text aren’t listed in the references at all (for example, [SLK01a]). I also thought that it would have been useful if the book had an associated web site, for example to test out the web-based clustering system from Chapter six, or even to download the html and stop list from the appendices.
This book could be of interest to those conducting IR, Artificial Intelligence (particularly machine learning), or cognitive science research. Web search engine developers would also be interested in the type of knowledge representation possible from a graph-based approach.
In conclusion, despite the thesis-like style and lack of editing, I found this an enjoyable book that raised a number of original and interesting ideas.
Graph-Theoretic Techniques for Web Content Mining
Adam Schenker, Horst Bunke, Mark Last and Abraham Kandel
World Scientific, 2005
Reviewed by: Jason Dowling
Click above to go to the World Scientific Bookshop web page for this book where you will be able to see the Table of Contents, Preface and an excerpt from Chapter 1.
Book Reviews Published in
the IAPR Newsletter
Advances in Image and Video Segmentation
Handbook of Mathematical Models in Computer Vision
by Paragios, Chen, and Faugeras (Editors)
The Geometry of Information Retrieval
by van Rijsbergen
Biometric Inverse Problems
by Yanushkevich, Stoica, Shmerko and Popel
Correlation Pattern Recognition
by Kumar, Mahalanobis, and Juday
Pattern Recognition 3rd Edition
by Theodoridis and Koutroumbas
Dictionary of Computer Vision and
by R.B. Fisher, et. Al
Kernel Methods for Pattern Analysis
by Shawe-Taylor and Cristianini
Machine Vision Books
CVonline: an overview
The Guide to Biometrics by Bolle, et al
Pattern Recognition Books
Jul. ‘04 [pdf]