Project IMPACT– Improving access to text

IMPACT is a project funded by the European Commission which aims to significantly improve access to historical text and remove the obstacles standing in the way of the massive digitisation of European cultural heritage.

This is a project with participants from different European countries united in a consortium.It comprises representatives of:

  • Universities (the University in Munich, Innsbruck, Salford, Bath and Charles University in Prague);
  • Scientific institutes and centres (Institute for Dutch Lexicology, National Centre for Scientific Research (Greece), Jožef Stefan Institute (Slovenia)
  • national and state libraries (the Austrian National Library, the British Library,the German National library, the National Library of France, the Royal Library of the Netherlands, the National and University Library in Slovenia, the National Library of the Czech Republic)
  • Private companies from Israel and Russia

impact_logoThe project is funded under the Seventh Framework Programme of the European Commission. The participating countries are: Austria, Bulgaria, Great Britain, Germany, Greece, Israel, Russia, Slovenia, France, the Netherlands, the Czech Republic.The National Library of the Netherlands is the coordinator of the project.Most of the countries have two participating institutions: an institution developing the project and a partner providing resources.The Bulgarian participants in the project are the Institute for Parallel Processing at the Bulgarian Academy of Sciences and the National Library “St. St. Cyril and Methodius”.The project objective at the impact level is to overcome the obstacles to the creation of a European digital library and the difficulties encountered in the digitization process in Europe so far. As part of its vision for the initiative i2010 European Digital Library, the EU proposed an ambitious plan for projects related to mass digitization in order to transform the European printed heritage into accessible digital resources.But the process was slowed down by the following factors:

  • The current OCR techniques (Optical Character Recognition) can only be used to a limited extent for reading valuable historical materials.Identifying old fonts with large variations in spelling or complex layouts in newspapers yields unsatisfactory results.The same applies to microfilms or unpublished texts written on a typewriter;
  • Contemporary vocabulary is not sufficient for identification of archaic words, endings and spelling variants of historical texts.

The project includes the following research highlights:

  • Ensure identification of all printed texts created before 1900;
  • One of the most promising lines of research within the project is the introduction of OCR software to start recognising obsolete letters.The project will expand this research and create a huge lexical resource with various forms of conjugation and endings of obsolete words that dropped out of usage and their relationship with the modern form of the word;
  • This approach will be tested in 9 European languages from three major language groups (Germanic, Slavic and Romance).

The idea of the project is:

  • To develop a multilayered linguistic vision on the accessibility of the digital text and itsincreasing use;
  • To make and introduce language tools and create lexical resources for languages not yet participating in the project, including Bulgarian, Slovenian, Czech, as well as to allow three more national libraries to provide databases, display project results and build digital competencies in their language areas.

Objectives of IMPACT project

The main objective of IMPACT project is to develop the innovations in OCR technology and language technology for inventorying and processing historical texts. The two leading industrial partners, IBM (Israel) and ABBYY (Russia), are involved in the development of the text recognition system. IMPACT explores new methods in image augmentation and segmentation, as well as in the use of language technology and historical vocabulary in the processingof the OCR technology.
Tools have been developed to build a dictionary (Thesaurus) and the use of vocabulary in OCR and the storage of digital copies, as well as document structuring tools.
The second objective of the project is to help improve the process of mass digitisation through shared experience and best practices and creating competencies for digitisation across Europe. For this purpose, a website, help desk, decision support tools, a training program, and a permanent Competence Centre will be built, where the requirements of the digital content owners across Europe and the research interests of partners inside and outside the project could be met.

Phases of the project

First Phase – 2008/2009 – the emphasis was on developing tools and content, accumulation of databaseand building applicable framework with a platform to demonstrate results. Resources have been developed for 3 languages from the German language group – English, German and Dutch.

In the second phase –2010/2011 – the IMPACT consortium included new partners in the use of linguistic tools to create language resources for languages that were not used in the project earlier and act as sites for testing, presenting, and gaining experience in their language fields.Partners from different language groups from Southern and Eastern Europe, including Bulgaria, were involved. The objectives of the project have expanded.The following will take place:

  • Presentation of IMPACT tools for the efficient construction of a dictionary (Thesaurus) for Slovenian, Bulgarian and Czech.Therefore the Department of Knowledge Technologies at Jožef Stefan Institute, the Institute for parallel processing at the Bulgarian Academy of Sciencesand the Institute for Czech Literature will work on improving the OCR software by using a special vocabulary for the historical language. In addition, the Bulgarian Academy of Sciences will introduce Early Cyrillic alphabet characters (obsolete letters) into the OCR software along with ABBYY.The National Library of Slovenia, the National Library of Bulgaria and the Czech National Library will collect and deliver databases for the development, evaluation and display of results;
  • Presentation and communication of project results in Slovenia, Bulgaria and the Czech Republic;
  • Developing a permanent Competence Centre.Adding more languages and library partners to the expanded project turns the IMPACT operational model into a Center of Competence in Europe.

Participation of the National Library in the IMPACT project

The National Library is a partner in the second phase of the project and undertakes to provide digital resources for testing the research studies on the part of Bulgaria, namely:

  • Provide databases for the development, evaluation and demonstration of OCR software – n particular to participate in the project with its database of digitized Bulgarian magazines and newspapers for the period 1882-1944.
  • Present and communicate the project results and support the development of digital competencies in Bulgaria.

What has been done so far within the Project

A large number of digital images (up to 5,000) had to be selected for the purposes of the project,to go through the so-called GT process. The GT process is creating metadata for each image,containing descriptions of the symbols, position of the image segments, etc.
Around 3,700 digital images taken with a camera (continuing editions and 2 collections) were selected for the first stage, and passed the OCR tests of our project partners from the Bulgarian Academy of Sciences.They showed good results, but there were problems with some characters (the letters of the Cyrillic alphabet п, н, и), which led to the need for a new selection of scanned documents and new tests.
About 3,000 images (only continuing editions), with a quality over 300 dpi, have been selected in the second stage andscanned with the new scanners of the Digital Centre of the National Library. They passed the OCR tests of the Bulgarian Academy of Sciences quite successfully, with very few remarks – stains on some pages affecting the readability, scribbles on some of the pages, drawings, etc.
We expect the start of image processing by a specialised company. It will create PAGE XML metadata that will be subsequently corrected.
The additional work of the National Library concerning the future development of the IMPACT project was related to the repeated selection of old printed books. Parts of them were scanned and sent to our partners from the Bulgarian Academy of Sciencesto recognise symbols that are out of use.