Talk 1

Dr. Jain warmed up his audience with an interesting example of data clustering in archeology (clustering of Apsara faces at the Angkor Wat temple), showing therefore that data categorization is a fundamental problem shared by many scientific fields.

The talk continued with an elegant statement of the data clustering problem, which is an unsupervised classification method for grouping objects into meaningful categories. Mathematically speaking, a clustering problem typically receives as input representations of N objects, and generates K clusters based on a measure of similarity.

Clustering started as a natural classification problem in biology, where the degree of similarity among forms was used to form taxonomies and phylogenetic relationships. Today, there are many more clustering-related problems, including data exploration and data compression in any scientific field that data. The exploratory nature of cluster analysis is well-suited to the data explosion phenomenon that we witness today. To manage an ever-expanding amount of data, one needs tools that reveal the underlying structure of data, generate hypotheses for future data trends, and detect anomalies. Clustering algorithms are at the core of these tools.

Dr. Jain’s talk discussed the milestones in the history of clustering algorithms. It was interesting to learn that ‘cluster analysis’ is a term coined by a 1954 article analyzing anthropological data. Another historical detail that intrigued the audience was the four independent discoveries of the K-Means algorithm (Steinhaus in 1956, Lloyd in 1957, Cox in 1957, Ball & Hall in 1967, and MacQueen in 1967). The basics of the K-means algorithm were briefly discussed, as well as some of its versions (Bisecting K-means by Karypis et al., X-means by Pelleg and Moore, Constrained K-means by Davidson, and Scalable K-means by Bradley et al.). Dr. Jain also presented the main paradigms in data clustering that go beyond K-means, such as Bayesian models, kernel methods, association rules (subspace clustering), graph mining, and large scale clustering.

One of the most interesting parts of Dr. Jain’s talk covered the ‘user dilemmas’ in data clustering. What is a cluster? What data representation paradigm is best suited? How many clusters? Which clustering method? Are the discovered clusters & partition valid? etc. Although some of these dilemmas were described by Dubes & Jain in a 1976 issue of Pattern Recognition, they remain strikingly actual and thus worth being revisited.

The final part of Dr. Jain’s talk addressed the new trends in clustering-related work, such as clustering of large-scale data, evidence accumulation by combining multiple partitions, multi-way clustering, and clustering complex data types.