Bibliographic Data

Overview

This benchmark evaluates the performance of large language models on extracting structured bibliographic information from historical academic documents. The benchmark consists of 5 pages from the "Bibliography of Works in the Philosophy of History, 1945–1957", each containing multiple bibliographic entries that models must extract and structure according to a predefined JSON schema.

Overview
Dataset Description
Ground Truth
Scoring
Observations

Dataset Description

Data Type	Images (JPG, 1743x2888, ~350 KB each)
Amount	5 images
Origin	http://www.jstor.org/stable/2504495
Signature	n/a
Language	English
Content Description	Bibliography of Works in the Philosophy of History, 1945–1957
Time Period	1945-1957 (works covered), 1961 (publication date)
License	Academic use
Tags	book-pages, list-like, printed-source, century-20th, bibliographic-entries, language-german

Role	Contributors
Domain expert	pema_frick
Data curator	pema_frick
Annotator	sven_burkhardt, pema_frick
Analyst	pema_frick, sorin_marti
Engineer	pema_frick, sorin_marti

The dataset contains 5 pages from a comprehensive scholarly bibliography published as "Chronological List." History and Theory, vol. 1, 1961, pp. 1–74. Each page contains multiple bibliographic entries listing books, articles, and other scholarly works that contribute to the philosophy of history. Entries include standard bibliographic information (author, title, publisher, year) and may contain cross-references to other entries, reviews, and additional notes.

Ground Truth

Ground Truth Creation

The ground truth was manually created by domain experts who extracted and structured the bibliographic information according to the defined schema. Each entry was annotated to capture all relevant bibliographic details, cross-references, and structural relationships between entries.

Ground Truth Format

The ground truth is stored in JSON files with the following structure based on the dataclass schema:

{
  "metadata": {
    "title": "Books",
    "year": "1945",
    "page_number": 2
  },
  "entries": [
    {
      "id": "1",
      "type": "book",
      "title": "Time as Dimension and History",
      "author": [
        {
          "family": "Alexander",
          "given": "Hubert G."
        }
      ],
      "publisher": "University of New Mexico Press",
      "publisher_place": "Albuquerque",
      "issued": 1945
    },
    {
      "id": "6",
      "type": "journal-article",
      "title": "Review of The Use of Personal Documents",
      "author": [
        {
          "family": "Lapiere",
          "given": "R. T."
        }
      ],
      "container_title": "The American Journal of Sociology",
      "volume": "LII",
      "issued": 1946,
      "relation": {
        "reviewed": "5"
      }
    }
  ]
}

Scoring

Evaluation Criteria

The models are tasked with extracting bibliographic entries from academic bibliography pages and outputting a structured JSON document. Models must identify and extract:

Entry identification: Unique identifiers for each bibliographic entry
Entry classification: Type of work (book, journal-article, review, other)
Author information: Family and given names of all authors
Publication details: Title, publisher, place, year, volume, pages as available
Cross-references: Relationships between entries (reviews, reprints, etc.)
Incomplete entries: Detection of entries that continue on subsequent pages

Expected Output Format

Models should output a JSON structure matching the dataclass schema with complete metadata and entry information.

Scoring Methodology

The extracted data is compared to the ground truth using fuzzy string matching with field-level evaluation:

Field Extraction: All terminal fields from both model response and ground truth are extracted
Field Comparison: Each field is compared using fuzzy string matching (RapidFuzz) with a threshold for exact matches
Score Calculation: A score between 0 and 1 is assigned to each field based on similarity
Total Score: The final score is computed as the average accuracy across all fields

Example Scoring

For a bibliographic entry with 8 extractable fields where the model correctly extracts 6 fields with perfect matches and 2 fields with partial matches (0.8 similarity each), the score would be: (6 × 1.0 + 2 × 0.8) / 8 = 0.95

Observations

Common challenges include:

Complex multi-author entries
Abbreviated journal titles and volume notations
Distinguishing between reviews and reviewed works
Handling incomplete entries that span pages

Current Limitations

Dataset Size: Only 5 pages may not capture full range of bibliographic complexity
Time Period: Limited to mid-20th century academic style
Language: English-only content
Domain: Focused specifically on philosophy of history

Future Work

Expand Dataset: Include more pages and different bibliographic styles
Multi-language Support: Add bibliographies in German, French, and other languages
Cross-domain Testing: Test on bibliographies from different academic disciplines
Temporal Coverage: Include bibliographies from different historical periods
Advanced Features: Add support for more complex citation relationships and metadata