This benchmark evaluates the ability of language models to extract structured information from medieval manuscripts. It focuses on the extraction of text from digitized images. The benchmark tests the model's ability to:
| Data Type | Images (JPG, 1872x2808, ~1.2 MB each) |
|---|---|
| Amount | 12 images |
| Origin | www.e-codices.ch/de/description/ubb/H-V-0015/HAN |
| Signature | Basel, Universitätsbibliothek, H V 15 |
| Language | Late medieval German |
| Content Description | 15th century manuscript, written by two scribes |
| Time Period | 15th century |
| License | Academic use |
| Tags |
manuscript-pages, text-like, handwritten-source, medieval-script, century-15th, language-german |
| Role | Contributors |
|---|---|
| Domain expert | ina_serif |
| Data curator | ina_serif |
| Annotator | ina_serif |
| Analyst | maximilian_hindermann |
| Engineer | maximilian_hindermann, ina_serif |

The ground truth for each page is structured as the example below.
{
"[3r]": [
{
"folio": "3",
"text": "Vnd ein pferit die mir vnd\n minen knechten vber hulfend\n den do was nienan kein weg\n denne den wir machtend\n vnd vielend die knecht dick\n vnd vil in untz an den ars\n vnd die pferit vntz an die \n settel vnd was ze mol ein grosser\n nebel dz wir kum gesachend\n vnd also mit grosser arbeit kome\n wir ze mittem tag zuo sant\n kristoffel vff den berg Do\n Do sach ich die buecher Do gar\n vil herren wopen in stond\n die ir stür do hin geben hand\n do stuond mines vatters seligen\n wopen och in dem einen",
"addition1": ""
}
]
}The benchmark uses two complementary metrics to evaluate the accuracy of extracted text: Fuzzy String Matching and Character Error Rate (CER).
Scoring a page
Each element - "folio", "text" and "addition1", "addition2", "addition3", etc. - is scored by comparing it to the corresponding ground truth entry using the following process:
Fuzzy Matching Example
For a ground truth text line:
"Vnd ein pferit die mir vnd"And a model response:
"und ein pferit die mir vnd"The fuzzy matching would yield a high similarity score (approximately 0.99) despite the minor spelling difference ("Vnd" vs. "und").
Character Error Rate (CER) Example
For the same example, the CER would be calculated as follows:
This results in a very low CER score (approximately 0.01), indicating excellent performance.
Scoring the collection
The overall benchmark scores are calculated as follows:
These metrics account for:
A perfect result would have a fuzzy score of 1.0 and a CER of 0.0, indicating that all texts were identified and transcribed with perfect fidelity to the original text.
Benchmark Sorting
In the benchmark overview page, medieval manuscript results are sorted by fuzzy score in descending order (highest scores at the top), allowing for quick identification of the best-performing models.
Planned extensions to this benchmark include: