Noisy unstructured text data is found in informal settings such as online chat, SMS, emails, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech, printed text, handwritten text contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuation, missing case information, pause filling words such as “um” and “uh.” Such text can be seen in large amounts in contact centers, on-line chat rooms, OCRed text documents, SMS corpus etc. Documents with historical text can also be considered noisy with respect to today’s knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analytics techniques. The theme of the International Joint Conference on Artificial Intelligence (IJCAI) 2007 Conference was "AI and its benefits to society." In keeping with this theme, the Workshop on Analytics for Noisy Unstructured Text Data (AND), which was held in conjunction with IJCAI 2007 proposed to look at text analytics of highly noisy text that is produced in such everyday applications in society.

The workshop was chaired by Craig Knoblock, Daniel Lopresti, Shourya Roy and L. Venkata Subramaniam. The workshop call for papers had a very good response. A total of 30 submissions spanning a diverse set of issues relevant to noisy text analytics were received of which 11 were accepted for oral presentation and 12 for poster presentation. Each submission was reviewed by three members of the program committee. To encourage discussion, the workshop program was structured into topic-oriented oral and poster sessions. The session topics included, classification of noisy text, detecting and correcting noisy text, and information extraction from noisy text. The program also had a keynote address and a panel discussion. Each oral session concluded with a two minute boaster by the authors of the poster papers. This allowed the audience to know in advance about the posters that would be presented later in the day. The proceedings of the workshop are available online at Selected papers of AND 07 will be published in a special issue on Noisy Text Analytics in the International Journal on Document Analysis and Recognition.

AND 07 had close to 60 registered participants making it the largest workshop at IJCAI 2007. The workshop was attended by participants from over 13 countries. As a result of the workshop an entry on "Noisy Text Analytics” was made in Wikipedia.  The keynote address by Gerald DeJong titled “Robustness through prior knowledge: Using explanation-based learning to distinguish handwritten Chinese characters” generated a lot of interest and discussion. The panel discussion lead by Daniel Lopresti that included Sreeram Balakrishnan, Hwee Tou Ng and Rohini Srihari as panelists had the provocative theme “Noisy text analytics: An exercise in futility?” The panel identified key problems, proposed some solutions and set the tone for future work in the area. All the author presentations and the keynote address and the panel lectures are available online at

The IAPR best student paper award which was decided by an eminent panel lead by Raghuram Krishnapuram went to Monojit Chaudhury, Rahul Saraf and Vijit Jain, the student authors of the paper “Investigation and Modeling of the Structure of Texting Language” that included Sudeshna Sarkar and Anupam Basu as the non-student authors. This paper was selected from among 15 papers in which the primary author was a student.

The social program consisted of a welcome dinner sponsored by IBM Research the previous night and the IJCAI inauguration and dinner on the night of the workshop. There was a beautiful cultural program in the inauguration that included classical Indian dances and east-west fusion music.

 Workshop ReportAND 2007

Text Box: 20th International Joint Conference on Artificial Intelligence (IJCAI 2007)

Workshop on Analytics for Noisy Unstructured Text Data

8 January 2007
Hyderabad, India

Report prepared by L. Venkata Subramaniam
Click here for Top of Page
Right Arrow: Next
Right Arrow: Previous








Proceedings of AND 2007 are available at the IJCAI 2007

web site

Organizing Committee:

Craig Knoblock, University of Southern California, USA

Daniel Lopresti, Lehigh University, USA

Shourya Roy, IBM Research, India Research Lab, India

L. Venkata Subramaniam, IBM Research, India Research Lab, India