Feature Article

Pattern Recognition in

Digital Libraries

 

 

 

 

 

 

By Larry O’Gorman

Click here for Top of Page
Right Arrow: Next
Right Arrow: Previous

Doing research into the Greenstone Digital Library project brought me back to my days as a student when, on a slow afternoon, I might just wander the stacks of the university library and let serendipity happen. I did the same with Greenstone, going to a web page listing examples of libraries built with Greenstone software, choosing a few, and perusing the contents. The choices were variously fascinating, entertaining, mesmerizing, and horrifying. I looked at pictures of bridges built during the 19th century in the Lehigh University Digital Bridges Collection, and of ships that plied the Great Lakes in the Great Lakes Shipping Database. I read about the history, culture, and land claims of the British Columbia aboriginal peoples in a library called, Our Homes are Bleeding / Nos Foyers Saignent. I listened to readings from poets and writers who had attended the Iowa Writers’ Workshop in The Writing University Archive. I learned about the mechanics of refrigeration in coursework material contained in, Revised Curricula for Nigerian Polytechs. I found material on instances and prevention of landslides in the WHO Health Library for Disasters. And, I sampled news articles, legislation, and pictures covering kidnapped children during the 1970s and 1980s in an archive called, Human Rights in Argentina.

In the late 1980s, a number of technologies were reaching the stage that would enable development of digital libraries. Probably, the three precipitating technologies were: faster computers, larger computer memories, and inexpensive scanning hardware (brought about due to widespread use of fax machines) to efficiently and economically handle high resolution images of pages containing text. These advancements paved the way for document analysis techniques of image capture and processing, binarization, and optical character recognition (OCR). With these technologies, a digital library could be built of books, journals, and other material that could be read from scanned images and searched via the recognized text.

In the early 1990s, Ian Witten, a professor at the University of Waikato in New Zealand, was performing work in text and index compression, and applying this work to the “gray literature” of research papers in computer science. At that time, Michel Loots, a Belgian doctor practicing in Africa, was formulating the idea that access to appropriate information was the overarching problem faced by developing countries, and contacted Ian for help on this. Under a newly formed Human Info NGO, Loots compiled several collections of humanitarian information from various international organizations using Ian’s software to make them available on CD-ROM as fully-searchable digital library collections. This was all that was needed to give Ian a focus that continues to this day.

The fact that his software supported many humanitarian CD-ROMs that were widely distributed in developing countries was influential and satisfying; however, Ian had learned something during this work. In the words of the Chinese proverb, “Give a man a fish and you feed him for a day. Teach a man to fish and you feed him for a lifetime.” If he could offer the tools to build a digital library, rather than the library itself, the recipients themselves could populate the contents in a more effective and sustainable fashion than if this were just given to them. This began what would become the Greenstone Digital Library project, a software package enabling users to develop and distribute digital library collections via CD-ROMs or the Web. A partnership with UNESCO (United Nations Educational, Scientific and Cultural Organization) provided input from potential end-users and facilitated wide distribution of the software. This project was built upon some fundamental principles. The interface must be simple to use for people with little or no technical training. The software must run on multiple operating systems; it could not depend upon the latest version of that operating system to run efficiently because many computers in developing countries had dated hardware and software. The software would have to be designed to support interfaces in many different languages. The package must be open source software.

The most extensive set of libraries built with Greenstone are under the New Zealand Digital Library, which contains about 40 libraries of different humanitarian and UN collections. Included here is: food and nutrition library, humanity development library, agricultural information modules, and the WHO (World Health Organization) medicines bookshelf. A particularly unusual library is built for the illiterate, which includes 20% of the world’s population and 40% of those in sub-Saharan Africa, the mid-East, and South Asia. This library, called First Aid in Pictures, contains simple illustrations of injuries and their proper first aid responses. The New Zealand Digital Library has been endorsed by the Communication Sub-Commission of the New Zealand National Commission for UNESCO as part of New Zealand's contribution to UNESCO. In 2004, this work was made the 7th recipient of the IFIP (Internatonal Federation for Information Processing) Namur award. This is a biennial award for an outstanding contribution with international impact to the awareness of social implications of information technology. Professor Witten’s award lecture was entitled, “Democratizing information: Digital libraries, developing countries, and information for all”.

Since November 2000, Greenstone has been downloaded on an average of 4500 copies per month. It is available to end-users in about 40 languages and to librarians in English, French, Spanish, and Russian. Training courses are given internationally. One country that has embraced this technology in a major way is India. As Ian explained, “Sometimes the adoption of new technology in developing countries leapfrogs currently adopted technologies in developed countries.” In India’s case, traditional libraries are rare, especially outside of the major cities; however there are substantial efforts now in making volumes available digitally. As evidence of this strong interest, the President of India, Dr.  A. P. J. Abdul Kalan, inaugurated the International Conference on Digital Libraries, held in New Delhi in 2004. In a speech demonstrating his support and keen awareness of digital libraries he spoke of the advantages these can bring to countries of the world, especially those less privileged.

Ian lists several challenges for future digital library work. The number one challenge is interoperability. The US Library of Congress and others participate in setting standards for digital libraries, and the pace of introduction of new standards has accelerated, so keeping up with this is a task in itself. Preservation of materials is another challenge. In Ian’s words, “No one yet understands the world of preservation.” We will face this challenge even with our personal “digital libraries” of digital photographs. Another challenge is education. Although Ian himself travels the world giving courses to augment the Greenstone training and material offered by UNESCO, there is still the need for more education to foster higher adoption rates. One challenge Ian addresses directly to the pattern recognition community is the need for open source OCR software and methods for page layout analysis and title/author extraction. Although commercial packages are available, these cannot be included in the open source Greenstone.

While this article has focused on the benefits of digital libraries for developing countries, Greenstone software can be used for any type of library – as evidenced by the North American bridge and ship libraries mentioned in the introductory paragraph.

 

If you have content that should be available to others, Greenstone is available to do this. If you are a researcher, look over the challenges described above to see if you might focus your efforts on any of these. And, if you just wish to virtually reenact wanderings of university stacks from your student days, go to www.greenstone.org, choose a library, and let serendipity happen.

Feature Articles on uses of

Pattern Recognition

 

 

Pattern Recognition at the US Postal Service:  A Decade of Achievement, Apr. ‘06

             [html]     [pdf]

 

Pattern Recognition in Two National Labs, Jan. ‘06

             [html]     [pdf]

 

Pattern Recognition in Traffic Engineering, Jul. ‘05

             [html]     [pdf]

 

Pattern Recognition in Astronomy and Photonics, Apr. ‘05

             [html]     [pdf]

 

Pattern Recognition in Origami, Jan. ‘05

             [html]     [pdf]

 

 

Pattern Recognition in Defense Applications, Jan. ‘04

                          [pdf]

 

Pattern Recognition in Maps, Sep. ‘03

                          [pdf]

 

Pattern Recognition in Security and Entertainment, Jun. ‘03

                          [pdf]

 

Pattern Recognition in Sports, Apr. ‘03

                          [pdf]

References

Greenstone Digital Library web page: www.greenstone.org

 

Ian Witten’s personal page: http://www.cs.waikato.ac.nz/~ihw/

 

New Zealand Digital Library:

http://www.sadl.uleth.ca/nz/cgi-bin/library

 

How to Build a Digital Library by Ian H. Witten, David Bainbridge, Morgan Kaufmann; 1st edition (July 15, 2002)

 

Namur Award lecture:

http://www.info.fundp.ac.be/%7Ejbl/IFIP/NA2004_Lecture.htm

Newsletter