AND

Noise in text can be defined as any kind of difference between the surface form of a coded representation of the text and the intended, correct, or original text. By its very nature, noisy text warrants moving beyond traditional text analytics techniques. Noise introduces challenges that need special handling, either through new methods or improved versions of existing ones. To raise and address some of those issues, the 2nd Workshop on Analytics for Noisy Unstructured Text Data (AND-II) was organized as a part of the 31st ACM Conference of the Special Interest Group on Information Retrieval (SIGIR) and held on 24 July 2008 in Singapore. The inaugural chapter of the Workshop took place in January 2007 in Hyderabad, India, in conjunction with International Joint Conference on Artificial Intelligence (IJCAI) (report on AND 2007 is available in the April 2007 issue of the IAPR Newsletter [html] [pdf]).

AND-II was very successful, like the first edition, and was attended by over 40 researchers from various academic institutions and companies from around the world. The workshop was chaired by Daniel Lopresti (Associate Professor, Lehigh University), Shourya Roy (Technical Staff Member, IBM Research, India Research Lab), Klaus U. Schulz (Professor, Univ. of Munich), and L. Venkata Subramaniam (Research Staff Member, IBM Research, India Research Lab).

The workshop began with the keynote session by Donna Harman from the US National Institute of Standards and Technology (NIST), where she is a scientist emeritus. Donna has been associated with various areas of text analytics and natural language processing for many years. In 1992, she started the Text Retrieval Conference (TREC), an ongoing forum that brings together researchers from industry and academia to test their search engines against a common corpus involving over a million documents, with appropriate topics and relevance judgments. In the keynote, titled “Some Thoughts on Failure Analysis for Noisy Data”, she talked about some current failure analysis techniques and how the techniques could be extended to retrieval from noisy data.

Following the keynote, John Tait from IRF Australia gave an invited talk on the notion of noise and its application to Information Retrieval. He urged the community to consider noise as an intrinsic property of information, not merely a problem to be eliminated.

There were total 27 submissions addressing various issues surrounding noisy text, out of which 12 papers were selected as full papers and 4 as posters. The papers were organized in 3 sessions. Apart from regular paper and poster presentations, AND-II had organized working group discussions around several topics relating to noisy text analytics. At the end of the conference the working groups presented their views on selected topics.

The oral sessions were:

· Errors and effects: This session was thought-provoking as speakers addressed various types of errors that creep into text when it is first generated or thereafter. It raised issues ranging from text typed by dyslexic users to OCR errors and their effects. The paper on “Latent Dirichlet allocation based multi-document summarization” by Rachit Arora and Balaraman Ravindran from the Indian Institute of Technology (IIT) Madras, from this session, was chosen for the IAPR-sponsored best student paper award. The authors showed how the use of latent Dirichlet allocation and mixture models can capture various topics being discussed in a document, and subsequently form the summary with sentences representing these different topics.

· Named entities and blogs: Four interesting papers were presented in this session. These papers addressed issues ranging from rule-based extraction of named entities to various issues pertaining to informally written blogs.

· Noisy environments: The final oral session was on issues arising from noisy environments. Text generated from such environments is inherently noisy but also can require special handling depending on the environment. Speakers talked about issues ranging from SMS ("short message service") processing to opinion mining from noisy text data. At the end of this session, discussions were held in the three working groups, after which the group leaders presented short summaries of each group's thoughts. The discussion topics were (1) data sets, benchmarks, and evaluation techniques for analysis of noisy text, (2) formal models for noise, characterization and classification of noise, and (3) linguistic analysis of noisy textual data and its role in information retrieval

The four poster papers were presented over tea before the final paper session. The posters addressed diverse topics such as blogs, Arabic lemmatization, etc. There was a “boaster” session in the morning where each presenter was given five minutes to boast about their work and encourage the audience to see their work during the afternoon session.

Workshop Report: AND 2008

Report prepared by Shourya Roy (India)

Text Box: 2nd Workshop on Analytics for Noisy Unstructured Text Data 24 July 2008 Singapore

Organizing Committee

Daniel Lopresti (USA)

Shourya Roy (India)

Klaus U. Schulz (Germany)

L. Venkata Subramaniam (India)

The workshop proceedings are available in electronic format at the

ACM Digital Library

A special issue consisting of selected papers will appear in the International Journal of Document Analysis and Recognition (IJDAR) at a future date. Invitations have already been extended to authors of the selected papers.

Keynote session given by

Donna Harman, NIST scientist emeritus