June 25, 2026
Science City Bahrenfeld
Europe/Berlin timezone

Enhancing OCR using Large Language Models

Not scheduled
1h 30m
AER Atrium (Science City Bahrenfeld)

AER Atrium

Science City Bahrenfeld

Albert-Einstein-Ring 8-10 22761 Hamburg
Poster and Lightning Talk Posterwalk and Lightning Talks

Speaker

Thomas Asselborn (Universität zu Lübeck)

Description

Historical documents remain difficult to digitise accurately, as OCR systems struggle with niche fonts, paper degradation, physical damage, and handwritten annotations. Consequently, OCR results often contain errors that impair the usability of archives. We examine two machine learning-based approaches to OCR post-correction. The first uses the LLM Llama 3 to identify, correct, and reconstruct erroneous or missing text. The second treats OCR output as a “language” and frames the post-processing as a machine translation task. Marian, a pre-trained sequence-to-sequence model, translates erroneous OCR text into its corrected form, thereby learning document-specific error patterns. Both approaches are compared in terms of accuracy and text reconstruction: LLMs offer flexibility and strong gap-filling capabilities; fine-tuned translation models provide faster and more hardware-efficient solutions.

Authors

Thomas Asselborn (Universität zu Lübeck) Dr Magnus Bender (Aarhus University) Prof. Ralf Möller (Universität Hamburg) Dr Sylvia Melzer (Universität zu Lübeck und Universität Hamburg)

Co-author

Jens Dörpinghaus (University of Koblenz, Federal Institute for Vocational Education and Training (BIBB))

Presentation materials

There are no materials yet.