This benchmark evaluates the performance of large language models on extracting structured bibliographic information from historical academic documents. The benchmark consists of 5 pages from the "Bibliography of Works in the Philosophy of History, 1945–1957", each containing multiple bibliographic entries that models must extract and structure according to a predefined JSON schema.
| Data Type | Images (JPG, 1743x2888, ~350 KB each) |
|---|---|
| Amount | 5 images |
| Origin | http://www.jstor.org/stable/2504495 |
| Signature | n/a |
| Language | English |
| Content Description | Bibliography of Works in the Philosophy of History, 1945–1957 |
| Time Period | 1945-1957 (works covered), 1961 (publication date) |
| License | Academic use |
| Tags |
book-pages, list-like, printed-source, century-20th, bibliographic-entries, language-german |
| Role | Contributors |
|---|---|
| Domain expert | pema_frick |
| Data curator | pema_frick |
| Annotator | sven_burkhardt, pema_frick |
| Analyst | pema_frick, sorin_marti |
| Engineer | pema_frick, sorin_marti |

The dataset contains 5 pages from a comprehensive scholarly bibliography published as "Chronological List." History and Theory, vol. 1, 1961, pp. 1–74. Each page contains multiple bibliographic entries listing books, articles, and other scholarly works that contribute to the philosophy of history. Entries include standard bibliographic information (author, title, publisher, year) and may contain cross-references to other entries, reviews, and additional notes.
Ground Truth Creation
The ground truth was manually created by domain experts who extracted and structured the bibliographic information according to the defined schema. Each entry was annotated to capture all relevant bibliographic details, cross-references, and structural relationships between entries.
Ground Truth Format
The ground truth is stored in JSON files with the following structure based on the dataclass schema:
{
"metadata": {
"title": "Books",
"year": "1945",
"page_number": 2
},
"entries": [
{
"id": "1",
"type": "book",
"title": "Time as Dimension and History",
"author": [
{
"family": "Alexander",
"given": "Hubert G."
}
],
"publisher": "University of New Mexico Press",
"publisher_place": "Albuquerque",
"issued": 1945
},
{
"id": "6",
"type": "journal-article",
"title": "Review of The Use of Personal Documents",
"author": [
{
"family": "Lapiere",
"given": "R. T."
}
],
"container_title": "The American Journal of Sociology",
"volume": "LII",
"issued": 1946,
"relation": {
"reviewed": "5"
}
}
]
}Evaluation Criteria
The models are tasked with extracting bibliographic entries from academic bibliography pages and outputting a structured JSON document. Models must identify and extract:
Expected Output Format
Models should output a JSON structure matching the dataclass schema with complete metadata and entry information.
Scoring Methodology
The extracted data is compared to the ground truth using fuzzy string matching with field-level evaluation:
Example Scoring
For a bibliographic entry with 8 extractable fields where the model correctly extracts 6 fields with perfect matches and 2 fields with partial matches (0.8 similarity each), the score would be: (6 × 1.0 + 2 × 0.8) / 8 = 0.95
Common challenges include:
Current Limitations
Future Work