Speaker
Description
Historical documents remain difficult to digitise accurately, as OCR systems struggle with niche fonts, paper degradation, physical damage, and handwritten annotations. Consequently, OCR results often contain errors that impair the usability of archives. We examine two machine learning-based approaches to OCR post-correction. The first uses the LLM Llama 3 to identify, correct, and reconstruct erroneous or missing text. The second treats OCR output as a “language” and frames the post-processing as a machine translation task. Marian, a pre-trained sequence-to-sequence model, translates erroneous OCR text into its corrected form, thereby learning document-specific error patterns. Both approaches are compared in terms of accuracy and text reconstruction: LLMs offer flexibility and strong gap-filling capabilities; fine-tuned translation models provide faster and more hardware-efficient solutions.