AND

Noisy unstructured text data is ubiquitous in real-world communications. Text produced by processing signals intended for human use such as printed/handwritten documents, spontaneous speech, and camera-captured scene images, are prime examples. Application of Automatic Speech Recognition (ASR) systems on telephonic conversations between call center agents and customers often see 30-40% word error rates. Optical Character Recognition (OCR) error rates for hardcopy documents can range widely from 2-3% for clean inputs to 50% or higher depending on the quality of the page image, the complexity of the layout, aspects of the typography, etc.

Recognition errors are not the sole source of noise; natural language and its creative usage can create problems for computational techniques. Electronic text from the Internet (emails, message boards, newsgroups, blogs, wikis, chat logs and web pages), contact centers (customer complaints, emails, call transcriptions, message summaries), mobile phones (text messages), etc., are often highly noisy and not ready for straight-forward electronic processing. They contain spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuation, missing case information, and pause-filling words such as “um” and “uh” in the case of spoken conversations. To raise and address some of those issues, the AND series of workshops were initiated in January 2007. Since then, the AND community has been active in area of noisy text analytics.

AND 2009 was, for the first time, a one-and-a-half day workshop. The first two editions were one-day workshops held in conjunction with International Joint Conference on Artificial Intelligence (IJCAI) in 2007 at Hyderabad, India and ACM SIGIR Conference in 2008 at Singapore. Like the first two editions, the third was very successful and was attended by over 25 researchers from various international academic institutions and business organizations.

The workshop began with a welcome note from Dan Lopresti. Dan talked about the uniqueness of this forum and the workshop and how it brings together two different but related communities doing “Document Image Analysis” and “Text Analytics.” This was followed by the first keynote of the workshop by Hildelies Balk, Programme Manager, IMPACT, European Union project for mass digitization of printed European culture. Hildelies has worked in the field of cultural heritage for over 20 years as a researcher and manager and presently she is serving as the coordinator for project IMPACT. She delivered an extremely interesting talk titled “Poor Access To Digitised Historical Texts: The Solutions of the IMPACT Project,” where she talked about, in detail, various technical issues pertaining to digitization of historical text owing to reasons such as historic fonts, complex layouts, ink bleed-through, and historical spelling variants. She also gave a comprehensive overview of different initiatives under which possible solutions are currently being developed by the IMPACT project.

The first paper of the workshop was aptly a survey paper titled “A Survey of Types of Text Noise and Techniques to Handle Noisy Text” discussing different types and sources of noise as well as measures traditionally used to measure noise. This paper was presented by Shourya Roy from Xerox India Innovation Hub. “Using Domain Knowledge for Ontology-Guided Entity Extraction from Noisy, Unstructured Text Data”, presented by Sergey Bratus and Anna Rumshisky, talked about the problem of information extraction from noisy text using onotologies and HMM. The third paper was presented by Lipika Dey from TCS Innovation Labs and talked about effects of noise in text mining applications such as opinion mining from web data sources. In the last paper of the session, Krishna Subramaniam from BBN Technologies presented his work titled “Robust Named Entity Detection Using an Arabic Offline Handwriting Recognition System.”

Session II started with an interesting single-author paper by Martin Reynaert towards handling of typographical variation or spelling errors in noisy text collection using a new approach based on anagram hashing. The AND series of workshops has been successful in not only bringing together researchers working in related areas but also bringing up new types of data for further research. The next paper was about a new domain and new type of data – “criminal investigation data.” Cristina Giannone presented work towards information extraction from such data using kernel-based techniques. This paper won the IAPR Best Student Paper Award. Following this was a presentation by Daniel Lopresti on an intriguing idea towards “Tools for Monitoring, Visualizing, and Refining Collections of Noisy Documents.” This was work-in-progress where he talked about their research towards developing tools to help users view and understand the results of common document analysis procedures and the errors that might arise.

The final session of the day was on focused on historical text. It began with a talk by Annette Gotscharek from the University of Munich on “Enabling Information Retrieval on Historical Document Collections – the Role of Matching Procedures and Special Lexica.” This work was a part of the IMPACT project and hence was strongly connected to one of the initiatives Hildelies had mentioned in the morning. This was followed by another paper on accessing information from historical text by Simone Marinai. Simone talked about an approach to index and retrieve text from early printed documents. They tested their technique on the well-known Gutenberg Bible. The last two papers, titled “A Comprehensive Evaluation Methodology for Noisy Historical Document Recognition Techniques” (presented by Nikolaos Stamatopoulos) and “Accessing the Content of Greek Historical Documents” (by Anastasios Kesidis) were interesting and thought-provoking for researchers who have been working in the area of historical text.

Like the second edition of the workshop, working group sessions were organized in AND 2009. Just before the commencement of the final session Dan Lopresti explained the purpose of the working groups, the logistics as well as some potential ideas. At the end of the day, the participants met and brainstormed to discuss the topic as well as formation of the actual groups. The topics identified were “Noisy Databases” and “Linguistic Analysis of Noisy Text Data.” The “Noisy Databases” working group discussed different types of databases viz. image-oriented and text-oriented, for analysis of noisy data. The other working group looked at the issue and importance of linguistic analysis of noisy text data. The discussion outcomes are presently being written up by volunteers who participated in the working groups and are targeted to be included in a special issue of the International Journal of Document Analysis and Recognition IJDAR.

The second day of the workshop started with an interesting and relevant keynote talk on “Handwritten Document Retrieval Strategies,” by Venu Govindaraju, Distinguished Professor of Computer Science and Engineering, The State University of New York, Buffalo. This was extremely informative for researchers who have worked in the field as well as for people who are beginning to consider such problem areas. Venu talked in detail about the importance of retrieval from handwritten documents, challenges, and various techniques to handle the same. He shared his invaluable experiences from real life engagements with organizations such as the U.S. Postal Service.

The final session of the workshop had four interesting papers. The first one, “Edge Noise in Document Images,” presented by Elisa Barney Smith, proposed a new measure “Noise Spread” for quantifying edge noise in image degradations produced by desktop scanning. Nazih Ouwayed discussed the problem of the skew angle estimation of noisy handwritten Arabic documents using the energy distributions of Cohen’s class. The paper titled “Digital Weight Watching: Reconstruction of Scanned Documents”, presented by Tim Gielissen, was about an efficient storage mechanism for scanned images of documents. The last paper of the workshop was a short paper by Kolyo Onkov titled “Effect of OCR-Errors on the Transformation of Semi-Structured Text Data Into Relational Database.”

Overall, AND 2009 was an interesting and valuable workshop attended by some of the leading researchers working in relevant areas.

Conference Report: AND 2009

Report prepared by Shourya Roy (India)