S129: Machine learning in Czech libraries - OCR for early printed and handwritten documents

Bibliothekskongress 2022

Session: 95, 96 oder 100 Prozent. OCR-D (S129)

Machine Learning in Czech Libraries - OCR for Early Printed and Handwritten Documents
P. Žabička¹
¹Moravian Library, Brno, Tschechische Republik

Abstract Text: The Moravian Library in Brno (MZK) has an extensive digital library containing a wide range of digitized documents. Most of these documents include OCR version used mainly for search and discovery purposes. For some document categories, the standard OCR did not produce adequate results: most notably manuscripts, old prints with variants of gothic script and scans of newspapers from microfilms. To solve this problem, MZK joined forces with the Brno University of Technology to develop a machine learning based OCR system for these categories of documents. Currently, the results of the PERO project can be tested on-line either through a web-based interface or an API. The software is open source and is already in use by several institutions. The paper will present current capabilities of the system as well as some of the secondary results of the project.

Speakers: Petr Žabička