Project Details
DPO-HP - Digital Preservation of OCR-D data for historical printings
Term
from 2018 to 2020
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 394410994
In order to provide high-quality and comprehensive research in the field of historical sciences, unrestricted access to historical sources is mandatory. Numerous images of historical prints from the 16th to the 19th century are now available by means of several cataloging and digitization projects. Not only the serial cataloging, but also the mass digitization of titles has been improved especially in the context of the “Verzeichnisse Deutscher Drucke”. The processed works have been cataloged not only according to national bibliographic standards, but have also been digitized to a large extent. The bibliographic metadata standard of these images already meets the scientific requirements. For further research, it is crucial to be able to specifically search and use the full texts of digitized works as well. The techniques of Optical Character Recognition (OCR) allow the mass creation of full texts. For an immediate usage in libraries, archives and other institutions, however, the methods used so far were not suitable, since the texts show too many orthographic differences. There has been intensive work on easily transferable applications that allow a high-quality mass-processing of all historical prints from the 16th to the 19th century. This will increase the number of OCR texts rapidly.For further usage, a sustainable preservation and identification of the images, the bibliographic metadata as well as the encoded full texts and their versions is obligatory. A standardized concept must be created in order to ensure this purpose. In addition, the availability and citation of the OCR texts is an important prerequisite for the verifiability of scientific results. Hence OCR texts must be added to the existing archive of a digital object along with its structure and metadata and images. Different versions of the same starting material are created through intellectual efforts, improvements especially in the OCR process or the usage of various OCR techniques, which bear a new challenge for persistent identification and long-term preservation. This problem contains aspects related to research data management and also requires the consideration of methods and strategies for dealing with research data.The above requirements must be conceptually prepared, integrated into an extended context, and implemented as a technical solution in order to meet the requirements of the data holders as well as the users. Based on this initial situation, this project defines the necessary steps for the realization of a solution for long-term preservation and persistent identification of OCR texts.
DFG Programme
Research data and software (Scientific Library Services and Information Systems)