This benchmark evaluates the ability of language models to extract structured information from historical documents printed in Fraktur (Gothic) typeface. Specifically, it focuses on the extraction of metadata and text from classified advertisements in the early modern newspaper "Basler Avisblatt" (1729-1844). The benchmark tests the model's ability to:
| Data Type | Images (JPG, 5000x8267, ~10 MB each) |
|---|---|
| Amount | 5 pages |
| Origin | https://avisblatt.dg-basel.hasdai.org |
| Signature | UBH Ztg 1 |
| Language | Early modern German |
| Content Description | The "Basler Avisblatt" was an early advertisement newspaper published in Basel, Switzerland. |
| Time Period | 1729-1844 |
| License | Academic use |
| Tags |
|
The collection comprises 116 yearbooks spanning the newspaper's publication history, with approximately 52 issues per year (increasing to more issues during the 1840s). Each issue varies in length from 1 to 16 pages and contains primarily classified advertisements, alongside occasional news articles, price lists, and announcements. The text is printed in traditional Gothic/Fraktur typeface, presenting challenges for modern OCR technologies and language models.
Ground Truth Creation
The ground truth for this benchmark consists of manually transcribed and structured data from selected pages of the Basler Avisblatt. Each advertisement is categorized with metadata and its complete text content, preserving the original formatting, spelling, and punctuation.
Guidelines for creating the ground truth
The following guidelines were used when creating the ground truth:
Metadata schema
Each advertisement in the ground truth is represented as a JSON object with the following properties:
| Field | Type | Description |
|---|---|---|
date | string | Publication date in ISO 8601 format (YYYY-MM-DD) |
tags_section | string | Title of the section containing the advertisement |
ntokens | integer | Number of words/tokens in the advertisement text |
text | string | Complete text of the advertisement as it appears in the original |
The entire ground truth file follows this structure:
{
"[key_id]": [
{
"date": "1731-01-02",
"tags_section": "Es werden zum Verkauff offerirt",
"ntokens": 21,
"text": "1. Ein Stücklein von in circa 20. Saum extra schön und guter rother Marggräffer-Wein von Anno 1728. in raisonnablem Preiß."
}
]
}The benchmark uses two complementary metrics to evaluate the accuracy of extracted advertisements: Fuzzy String Matching and Character Error Rate (CER).
Scoring an ad
Each advertisement is scored by comparing it to the corresponding ground truth entry using the following process:
Fuzzy Matching Example
For a ground truth advertisement:
"5. Eine zimblich wohl-conditionirte Violino di Gamba, so im Adresse-Contor kan gesehen werden."
And a model response:
"5. Eine zimlich wohl-conditionirte Violino di Gamba, so im Adresse-Contor kan gesehen werden."
The fuzzy matching would yield a high similarity score (approximately 0.99) despite the minor spelling difference ("zimblich" vs. "zimlich").
Character Error Rate (CER) Example
For the same example, the CER would be calculated as follows: 1. Calculate the Levenshtein distance (edit distance): 1 character substitution 2. Calculate the CER: 1 / (length of reference text) ≈ 0.01
This results in a very low CER score (approximately 0.01), indicating excellent performance.
Scoring the collection
The overall benchmark scores are calculated as follows:
These metrics account for: - Correctly identified section headings - Correctly matched advertisement numbers - Textual similarity to the ground truth
A perfect result would have a fuzzy score of 1.0 and a CER of 0.0, indicating that all advertisements were identified and transcribed with perfect fidelity to the original text.
Benchmark Sorting
In the benchmark overview page, Fraktur results are sorted by fuzzy score in descending order (highest scores at the top), allowing for quick identification of the best-performing models.
Planned extensions to this benchmark include: