AND

Noisy unstructured text data is ubiquitous in real-world communications. Text produced by processing signals intended for human use, such as printed/handwritten documents, spontaneous speech, and camera-captured scene images, are prime examples. Application of Automatic Speech Recognition (ASR) systems on telephonic conversations between call center agents and customers often see 30-40% word error rates. Optical Character Recognition (OCR) error rates for hardcopy documents can range widely from 2-3% for clean inputs to 50% or higher depending on the quality of the page image, the complexity of the layout, aspects of the typography, etc.

Recognition errors are not the sole source of noise; natural language and its creative usage can create problems for computational techniques. Electronic text from the Internet (emails, message boards, newsgroups, blogs, micro-blogs, wikis, chat logs and web pages), contact centers (customer complaints, emails, call transcriptions, message summaries), mobile phones (text messages), etc., are often highly noisy and not ready for straight-forward electronic processing. They contain spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuation, missing case information, and pause-filling words such as “um” and “uh” in the case of spoken conversations. To raise and address some of those issues, the AND series of workshops were initiated in January, 2007. Since then, the AND community has been active in the area of noisy text analytics.

The 4th Workshop on Analytics for Noisy Unstructured Text Data (AND 2010) was organized as a part of the Nineteenth International Conference on Information and Knowledge Management (CIKM). The first two editions were one-day workshops held in conjunction with the International Joint Conference on Artificial Intelligence (IJCAI) in 2007 in Hyderabad, India and the ACM SIGIR Conference in 2008 in Singapore. The third was a one and a half day workshop held in conjunction with the International Conference on Document Analysis and Recognition (ICDAR) in 2009. Like the first three, the 2010 edition was very successful. Over 50 attendees from various academic institutions and business organizations from eighteen different countries participated in the workshop.

AND 2010 began with a welcome note from Dan Lopresti. Dan talked about the uniqueness of this forum and the workshop and how it brings together two different but related communities doing “Document Image Analysis” and “Text Analytics.” This was followed by the keynote by Randy Goebel, Professor, University of Alberta, Canada. Randy Goebel who works in the areas of knowledge representation, logic-based non-deductive reasoning, machine learning, visualization, belief revision, systems biology, and computational linguistics delivered an extremely interesting talk titled “The Nature of Noise in Linguistic Corpora,” where he talked in detail about the what really constitutes noise in a given corpus. He talked about his recent work on developing computational methods for extracting linguistic structures from relatively large language corpora, including the use of well-known, standard, labeled language resources, such as those from the Linguistic Data Consortium, as well as a spectrum of unlabeled resources, including the Google n-gram repository and a variety of more specific search engine query and answer resources (e.g., from Sogou).

The workshop papers were organized into three sessions. Earlier, all submissions to the workshop had been reviewed by three members of the program committee. Out of a total of 21 submissions 11 were selected for the workshop. The papers covered a wide range of topics. Being collocated with CIKM 2010, AND 2010 clearly had the flavor of both Information Retrieval and Knowledge Management. This year there were at least three papers dealing with noisy aspects of social network data, like Twitter data. Further, in line with earlier years, there were papers on analyzing noisy data from automatic speech recognizers and OCR. The papers covered application domains ranging from automatic scoring of student essays to opinion mining to information extraction.

The final session of the day was the panel discussion led by three leading researchers, Yuji Matsumoto (Information Science, Nara Institute of Science and Technology, Japan), Seamus Ross (Faculty of Information, University of Toronto, Canada), and Gareth Jones (Dublin City University, Ireland). The session began with the moderator Christoph Ringlstetter introducing the panelists, setting the tone for the discussion, and then inviting the panelists to give their opening remarks before opening up the discussion to all workshop participants. The panel discussion was aptly titled “Why is it impossible to handle noisy text with existing techniques: The way forward.” The panelists raised and tried to answer some very pertinent questions like, What is noise in text documents? Does noise influence research decisions? Should such noise be processed or corrected? There was enthusiastic participation in the panel discussion.

Finally Venkat Subramaniam gave the closing remarks and announced the IAPR Best Student Paper Award winner. This year’s winner was Julien Fayolle for the paper “Reshaping automatic speech transcripts for robust high-level spoken document analysis” by Julien Fayolle, Fabienne Moreau, Christian Raymond and Guillaume Gravier.

Overall, AND 2010 was an interesting and valuable workshop attended by some of the leading researchers working in relevant areas. It is expected that selected papers from the workshop will appear in a special issue of International Journal of Document Analysis and Recognition.

Workshop Report: AND 2010

Report prepared by the Workshop Co-Chairs