ICDAR2017

Special Workshop Speaker

Title

Towards Deeper Understanding of Wider Category of Documents


Name

C. V. Jawahar

Abstract

Human will continue to create and consume documents. This will necessitate methods, solutions and systems for understanding the documents as well as interacting with these documents. With the emerging changes in the form, content and organization of the documents, there will be a natural drift in the problem as well as solution space for document understanding. Recent emergence of machine learning as a powerful tool to solve a class of perception problems possibly hint us that the semantically richer tasks and visually complex data are now reachable. Here we discuss three specific example directions.

In the near future, semantic understanding of large document repositories could emerge as a basic necessity. Large document collections (such as digital library of historical documents to research papers) is now common. Understanding is indeed beyond the simple recognition. It is not just creation of text or reproducing the documents. Meanings and relationships between entities (eg. words, sentences, graphics) need to be extracted from the documents and the associated contexts. Mechanisms to extract useful information (using complex queries, interactive dialogues and predicting user intentions) will become an absolute necessity. This is not mere adaptation and extension of the popular solutions in NLP and IR. There are many special challenges ahead of us. These include incorporating the noise and uncertainty that gets introduced in the recognition phase into the document understanding process, decoding and interpreting the visual arrangement of entities in the physical and electronic documents, understanding the human interaction patterns and human intend etc.

Wearable and mobile vision systems also need to now capture, process, understand documents. They introduce a new class of challenges for the researchers and practitioners. The challenges at the capture stage itself (mobile apps and preprocessing, rectification and restoration etc.) is not yet fully understood. Beyond the traditional problems getting mapped into this pipeline, there are many new problems to emerge in this space. Understanding textual content in the wild (beyond the simple scene text words), techniques that assist casual capture of the documents (unlike the focused capture of the documents) in first person vision systems, imaging related issues for text, perceptual and cognitive aspects of human document interaction etc. are examples of the problems in this space.

Even the existing advances in document analysis and recognition need many further improvements and extensions. The results in these areas need additional work to reach newer languages or newer document categories. This is not mere replacement of the existing training data or a simple domain adaptation task. Problem formulation as well as the capture of the domain information (eg. scripts, symbols, grammar, linguistics) itself is in infancy in many areas. Even if the basic recognition is well understood (eg. numeral or character recognition), use of higher level information to recognize, for example a financial summary sheet, is still largely adhoc. How do high level prior knowledge interact with the low level symbol recognizers? This need further investigation. Mathematical models that capture the domain knowledge are required for designing systems that can be analyzed better.

In short, there are a number of problems and directions yet to be investigated in detail. Many new problems are emerging due to the advances in technology, algorithms and human needs. We, as a community, need to adapt and take leadership in formulating and solving these problems.


Short Bio

C. V. Jawahar is a professor at IIIT Hyderabad, India. He received PhD from IIT Kharagpur and has been with IIIT Hyderabad since Dec. 2000. At IIIT Hyderabad, Jawahar leads a group focusing on computer vision, machine learning and multimedia systems. In the recent years, he has been looking into a set of problems that overlap with vision, language and text. He is also interested in large scale multimedia systems with special focus on retrieval. He has more than 50 publications in top tier conferences in computer vision, robotics and document image processing. He is a fellow of IAPR. He has served as a chair for previous editions of ACCV, WACV, IJCAI, ICCV and ICVGIP. Presently, he is an area editor of CVIU and an associate editor of IEEE PAMI. He is also a program co-chair for ICDAR 2017 and ACCV 2018.