This benchmark evaluates the performance of large language models on extracting bibliographic information from index cards. The benchmark consists of 263 images containing descriptions of historical dissertations (before 1980, some well before 1900). The set contains both typeset and handwritten cards, and the format and exact content of the descriptions varies.
[[button_to_leaderboard]] [[button_to_prefilled_search]]
| Data Type | Images () |
|---|---|
| Amount | 263 images |
| Origin | https://ub.unibas.ch/cmsdata/spezialkataloge/ipac/searchform.php?KatalogID=ak2 |
| Signature | |
| Language | Mostly German; some French, English, Latin, Greek, Finnish, Swedish, Polish |
| Content Description | Dissertationenkatalog bis 1980 |
| Time Period | -1980 |
| License | CC0 |
| Tags |
The dataset contains 263 images of index cards describing historical dissertations (96.2%) or references to such dissertations (3.8%). Each image corresponds to one card and one dissertation. It is a random sample out of the ~700'000 dissertations collected by Basel University Library in the time period before 1980. The original works come predominantly from Switzerland and neighboring countries, but some may come from anywhere in the world.
Strictly speaking, a typical card may describe multiple things and events related to a given dissertation/PhD thesis (an abstract work):
Not all of these elements are present in every case, and they are often not explicitly separated on the card. Furthermore, some of the cards do not contain a full description of a thesis, but are merely references to another card in the catalogue. In these cases, the card begins with the name of the referenced author, followed by an "s." on a separate line (German "siehe"). There may or may not be other information below that.
Ground Truth Creation
The ground truth was created by manual correction of responses generated by chat-gpt-4o, using the ground-truther.py tool. In addition to correct readings of the text, the following rules were enforced (cf. also the instructions given in prompt.txt):
Ground Truth Format
The ground truth is stored in JSON files with the following structure based on the dataclass schema:
{
"type": {
"type": "Dissertation or thesis OR Reference"
},
"author": {
"last_name": "string",
"first_name": "string"
},
"publication": {
"title": "string",
"year": "integer",
"place": "string (optional)",
"pages": "string (optional)",
"publisher": "string (optional)",
"format": "string (optional)"
},
"library_reference": {
"shelfmark": "string (optional)",
"subjects": "string (optional)"
}
}The models are tasked with extracting bibliographic information from historical dissertation index cards. Models must output a JSON structure with the fields defined in `dataclass.py`.
Key extraction requirements
Expected output format sample (see example image above)
{
"type": {
"type": "Dissertation or thesis"
},
"author": {
"last_name": "Müller",
"first_name": "Maurice Edmond"
},
"publication": {
"title": "Die hüftnahen Femurosteotomien unter Berücksichtigung der Form,Funktion und Beanspruchung des Hüftgelenkes",
"year": 1957,
"place": "Stuttgart",
"pages": "X,184",
"publisher": "Thieme",
"format": "4'"
},
"library_reference": {
"shelfmark": "AT Zürich 7",
"subjects": ""
}
}
Scoring Methodology
The scoring system implements field-level F1 evaluation using the following methodology:
F1 = 2 * precision * recall / (precision + recall)
Benchmark Scoring
The benchmark provides both micro and macro F1 scores:
Example Scoring
Example scoring for the ground truth and the corresponding image displayed above.
{
"type": {
"type": "Dissertation or thesis"
},
"author": {
"last_name": "Müller",
"first_name": "Maurice"
},
"publication": {
"title": "Die hüftnahen Femurosteotomien unter Berücksichtigung der Form, Funktion und Beanspruchung des Hüftgelenkes",
"year": 1957,
"place": "Stuttgart",
"pages": "X, 184",
"publisher": "Thieme Verlag",
"format": "4'"
},
"library_reference": {
"shelfmark": "AT Zürich 7",
"subjects": ""
}
}
Field-by-Field Analysis
type.type: "Dissertation or thesis" = "Dissertation or thesis" ✓ TPauthor.last_name: "Müller" = "Müller" ✓ TPauthor.first_name: "Maurice" ≠ "Maurice Edmond" (fuzzy match < 0.92) ✗ FP/FNpublication.title: Perfect match ✓ TPpublication.year: 1957 = 1957 ✓ TPpublication.place: "Stuttgart" = "Stuttgart" ✓ TPpublication.pages: "X, 184" ≠ "X,184" (fuzzy match < 0.92) ✗ FP/FNpublication.publisher: "Thieme Verlag" ≠ "Thieme" (fuzzy match < 0.92) ✗ FP/FNpublication.format: "4'" = "4'" ✓ TPlibrary_reference.shelfmark: "AT Zürich 7" = "AT Zürich 7" ✓ TPlibrary_reference.subjects: "" = "" ✓ TPResults
Calculations
Preliminary results indicate that models generally perform well on clearly typed cards with standard formats. However, performance drops on handwritten cards and those with non-standard layouts or abbreviations. Common error patterns include:
Current Limitations
Dataset Limitations
Evaluation Limitations
Future Work
Dataset Enhancements
Evaluation Improvements
Benchmark Extensions