• contact@dh-lab.hu
  • 1088 Budapest Múzeum krt. 6-8

A joint project of the Digitális Örökség Nemzeti Laboratórium (DH-LAB), operating under the consortium leadership of Eötvös Loránd Tudományegyetem (ELTE), and the Erdélyi Digitális Tudománytár (Digitéka) has been successfully completed. As a result, several hundred thousand pages of Hungarian-language Transylvanian press materials have become searchable and sustainably preserved through the use of state-of-the-art digital technologies.

The partners aimed to elevate the digital processing of historical Transylvanian press sources to a new level while contributing to the modern research accessibility of Hungarian-language cultural heritage.

In the first phase of the work, optical character recognition (OCR) was carried out on approximately 273,000 page images from 26 historical Transylvanian newspapers. This was followed by the processing of an additional more than 60,000 pages provided by the partner. Altogether, 333,492 pages of Hungarian-language Transylvanian press materials were processed. The completed files were delivered to Digitéka in dual-layer, searchable PDF format with a standardized watermark.

The professional significance of the project goes beyond digitization. To improve the efficiency of OCR processes, the partners—also drawing on ELTE’s research and development expertise and infrastructure—jointly developed a so-called layout analysis system for recognizing document structure. Within this framework, Digitéka’s annotators processed 1,007 pages, which, together with the pages prepared by DH-LAB annotators, resulted in a training dataset of 4,078 annotated pages in total. This dataset lays the foundation for a layout-recognition system specifically optimized for Transylvanian and Hungarian historical documents, significantly improving OCR accuracy.

Megosztás

Add Your Comments

Icon

Your email address will not be published. Required fields are marked *