ICDAR2017

Special Workshop Speaker

Title

From OCR (Optical Character Recognition) to IDS (Intelligent Decoding Systems): an edit distance of 3, a time span of 100.


Name

Josep Llados

Abstract

Paper documents have been for centuries one of the major communication channels between humans. Document Image Analysis and Recognition (DIAR) was born with the aim of giving to computers the ability of converting scanned documents to editable files through specialized software: OCR, HTR and Graphics Recognition. But documents are no longer static, and have evolved to electronic form, digitally-born and with multimedia content.

The advent of digitally born documents and their role in the digital society does not shade the field of DIAR but empowers it, pivoting to new directions. We outline a dual vision of the future. On one hand, paper documents will exist for decades. Thus, the concept “document images”, digitally preserving paper documents, will persist. Document engineering, of which DIAR is a part, is and will be a key activity of the productivity of companies. Digital mailroom systems for large scale processing and information extraction from documents in different formats (images, pdf, e-mail, blogs, posts, etc.) will be at the heart of business intelligence. Also, the subfield called “historical document analysis” is increasing its protagonism. The documents that are produced nowadays will become historical in a few decades. Will at some point DIAR be replaced by HDIAR? The common basis of the mentioned scenarios is: large scale processing, semantic interpretation and heterogeneous contents structure.

On a second hand, in our twofold vision of the future, and from a semiotic point of view, the field will move from the signifier (recognition of the compounding signs, textual or graphic) to the significant, i.e. the reading and understanding of the sign system in the context it appears. Here is where the focus on the document as an object container of signs is fading away. Alternatively the concept of intelligent reading system, or decoding signs, understood as the language of communication between humans, emerges. Thus, we are moving the subject from the container to the function.

Considering the methodological side, traditionally DIAR systems have been strongly pipelined, with pre-established and highly-tuned modules. For example, a typical OCR sequentially follows the tasks of binarization, line detection, word and character segmentation, and character classification. Although the methods have improved a lot allowing multiple fonts, complex layouts, different languages, etc. the waterfall architecture has been understood as standard. The success of deep learning applied to different problems in computer vision has given powerful new tools for DIAR, involving a change in the paradigm without assuming the traditional processing pipelines. Instead, holistic models based on training the algorithms with large amounts of data are proposed. Thus, assuming that this direction will remain in the future, it opens another interesting challenge: data annotation. Machine learning methods are used to be highly supervised, requiring high volumes of annotated data so the algorithms can learn to make predictions. This annotation is manually done, through crowdsourcing platforms such as Amazon’s Mechanical Turk. The challenge of the future is to find new ways of generating such large amount of ground truth. Gamesourcing, i.e. incorporating implicit labeling tasks in computer games, is an interesting direction. Another promising idea is to generate realistic-enough images using Generative Adversarial Networks (GANs).

As a matter of conclusion, and going back to the title of this document, we consider that the domain of DIAR is alive but in a reformulation stage aligning it with the new demands of the market and the new trends in research. We conceptualize it with a metaphor based on the edit transformation between the acronyms OCR and what we call IDS (Intelligent Decoding Systems). The metaphor states that the two concepts are close (there is a “small” edit distance between the two acronyms), but a century has passed since the first OCR were sketched. Behind this, we illustrate the change of paradigm as follows. First, Optical recognizers become Intelligent ones, i.e. we are going beyond an optical pattern matching scenario, but we are solving an artificial intelligence problem, narrowing the semantic gap. Second, a recognition based in Characters is converted to a Decoding problem, i.e. we are no longer based on letters, but on signs in a human communication activity. Finally, the concept of Recognition takes the dimension of a System, that means that nowadays onwards we should be ambitions in developing end-to-end systems, more than manually-designed steps of a classical pipeline. Citizens of the future will demand reading services in their daily life activity: autonomous cars that read traffic signs, travel assistants that read restaurant menus or translate texts in foreign languages, reading systems embedded in social media platforms to extract information from the uploaded pictures, search and summarization in large scale document databases and workflows, etc.


Short Bio

Josep Lladós received the degree in Computer Sciences in 1991 from the Universitat Politècnica de Catalunya and the PhD degree in Computer Sciences in 1997 from the Universitat Autònoma de Barcelona (Spain) and the Université Paris 8 (France). Currently he is an Associate Professor at the Computer Sciences Department of the Universitat Autònoma de Barcelona and a staff researcher of the Computer Vision Center, where he is also the director since January 2009. He is associate researcher of the IDAKS Lab of the Osaka Prefecture University (Japan). He is chair holder of Knowledge Transfer of the UAB Research Park and Santander Bank. He is the coordinator of the Pattern Recognition and Document Analysis Group (2014SGR-1436). His current research fields are document analysis, structural and syntactic pattern recognition and computer vision. He has been the head of a number of Computer Vision R+D projects and published more than 230 papers in national and international conferences and journals, and supervised 12 PhD theses. J. Lladós is an active member of the Image Analysis and Pattern Recognition Spanish Association (AERFAI), a member society of the IAPR. He is currently the chairman of the IAPR-EC (Education Committee). Formerly he served as chairman of the IAPR-ILC (Industrial Liaison Committee), the IAPR TC-10, the Technical Committee on Graphics Recognition, and also he is a member of the IAPR TC-2 (Structural Pattern Recognition), IAPR TC-11 (Reading Systems) and IAPR TC-15 (Graph based Representations). He is chief editor of the ELCVIA (Electronic Letters on Computer Vision and Image Analysis). He is co-Editor in Series in Machine Perception and Artificial Intelligence (SMPAI) of World Scientfic Publishing Company. He serves on the Editorial Board of the Pattern Recognition journal, in IJDAR (International Journal in Document Analysis and Recognition), the Frontiers in Digital Humanities journal, and also a PC member of a number of international conferences. He was the recipient of the IAPR-ICDAR Young Investigator Award in 2007. He was the general chair of the International Conference on Document Analysis and Recognition (ICDAR’2009) held in Barcelona in July 2009, and co-chair of the IAPR TC-10 Graphics Recognition Workshop of 2003 (Barcelona), 2005 (Hong Kong), 2007 (Curitiba) and 2009 (La Rochelle). Josep Lladós has also experience in technological transfer and in 2002 he created the company ICAR Vision Systems, a spin-off of the Computer Vision Center working on Document Image Analysis, after winning the entrepreneur award from the Catalonia Government on business projects on Information Society Technologies in 2000.