RISE Humanities Data Benchmark, 0.4.0

Business Letters

Overview

This benchmark focuses on the extraction of metadata from historic letters. It provides a ground truth for the metadata categories `send_date`, `sender_persons` and `receiver_persons` for a collection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft between 1926 and 1932, and F1-micro and F1-macro scores across these categories.

Dataset Description

Data TypeImages (JPG, 2479x3508, ~600 KB each)
Amount57 letter, 98 images
Originhttp://dx.doi.org/10.7891/e-manuscripta-54917
SignatureCH SWA HS 191 V 10
LanguageGerman
Content DescriptionCollection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft
Time Period1926 - 1932
LicenseAcademic use
Tags

This benchmark uses as input the digital collection "Basler Rheinschifffahrt-Aktiengesellschaft, insbesondere über die Veräusserung des Dieselmotorbootes 'Rheinfelden' und die Gewährung eines Darlehens zur Finanzierung der Erstellung des Dieselmotorbootes 'Rhyblitz' an diese Firma" (shelf mark: `CH SWA HS 191 V 10`, persistent link: http://dx.doi.org/10.7891/e-manuscripta-54917, referred to as "the collection" in what follows) of the Schweizer Wirtschaftsarchiv.

The collection consists of 68 letters of various length (mostly 1-3 pages). The letters are dated between 1926 and 1932 and are written in German. The letters are mostly typewritten, with some handwritten annotations. The letters reflect the correspondence of the Basler Rheinschifffahrt-Aktiengesellschaft and are mostly signed by individuals or companies. The letters cover a variety of topics, including business transactions, shipping schedules, and personnel matters. In this benchmark, a subset of 57 letters have been ground truthed (see below). 
 

Ground Truth

The ground truth for the collection has been created in a Google Sheet and then imported and used to benchmark LLMs with respect to information extraction tasks. The following fields have been extracted:

 

Field NameDescriptionData Type
transkribus_doc_urlURL link to the letter on Transkribus.

string (URL)

document_numberThe letter's number is between 1 and 68, inclusive (1 ≤ i ≤ 68).

zero padded integer

doneIndicates whether the creation of the ground truth is completed.

boolean

checked_byIdentifier of the person who is responsible for creating the ground truth.

string

send_dateDate when the letter was sent.

ISO 8601 date or None

letter_titleTitle of the letter as diplomatically inscribed.

string or None

sender_persons_inscribedSender person(s) as diplomatically inscribed.

string or None

sender_personsIndividuals explicitly mentioned as senders in the document.

string or None

receiver_persons_inscribedReceiver person(s) as diplomatically inscribed.

string or None

receiver_personsIndividuals associated with receiving the document, inferred or explicitly stated.

string

has_signaturesIndicates whether the document contains signatures.

boolean

signatures_recognisedIndicates whether all signatures have been mapped to persons as per ground_truth/persons.json.

boolean

commentAdditional comments or annotations about the document.

string or None

action_requiredIndicates what action is required to get to document done.

string

Persons

Persons are recorded in the persons tab of the Google Sheet. The metadata schema and workflow for persons is described in ground_truth_persons_organizations.md.

For sender_persons_inscribed, sender_persons receiver_persons_inscribed, receiver_persons:

  • Pipe | is used to separate multiple values: Mustermann, Hans | Musterfrau, Maria.
  • Indicate persons inferred from function & date with angle brackets: <Mustermann, Hans>
  • Indicate persons inferred from the correspondence history with double angle brackets: <<Musterfrau, Maria>>

The sender_persons and receiver_persons fields use the names of persons as recorded in the normalized persons tab. Be sure to add inscribed variants to their respective alternateName fields.

Importing the ground truth from Google Sheets

Letters that are done are exported from the ground_truth_export tab of the Google Sheet as a CSV file and saved to ground_truth/letters.csv.

Scoring

Consider the first letter as an example. The letter is composed of three pages.

MetricGround TruthPredictionTPFPFN

send_date

1926-02-161926-02-16100

sender_persons

Groschupf-Jaeger, Louis
Ritter-Dreier, Fritz
Basler Rheinschiffahrt-Aktiengesellschaft012

receiver_persons

Christ-Wackernagel, PaulHerr Christ
i/Fa. Paravicini, Christ & Co.
110
  • send_date: The prediction matches the ground truth (1 TP).
  • sender_persons: The prediction is incorrect (1 FP) as "Basler Rheinschiffahrt-Aktiengesellschaft" is not a sender person, and the two actual sender persons "(Groschupf-Jaeger, Louis" and "Ritter-Dreier, Fritz") are missing (2 FN).
  • receiver_persons: The prediction is partly correct as "Herr Christ" is mentioned as a receiver person (1 TP), and the prediction is partly incorrect as "i/Fa. Paravicini, Christ & Co." is not a receiver person (1 FP).

 

Scoring the collection

With scores for each letter in place, we can calculate the overall performance of an LLM on the collection. We calculate F1-micro and F1-macro:

  • F1 is the harmonic mean of precision and recall, where precision is TP / (TP + FP) and recall is TP / (TP + FN).
  • F1-micro is the harmonic mean of precision and recall across all categories.
  • F1-macro is the average of F1 scores across all categories.

 

Rule parameters

  • inferred_from_function: If true, the person is inferred from their function and the date (e.g., a letter from the Basler Personenschifffahrtsgesellschaft in 1925 signed by "der Präsident" was penned by Max Vischer-von Planta ).
  • inferred_from_correspondence: If true, the person is inferred from the correspondence history (e.g., "referring to your letter from last week").
  • skip_signatures: If true, then letters with signatures are not scored.
  • skip_non_signatures: If true, then letters without signatures are not scored.

Observations

TODOs

Add more fields to the metadata schema, namely `sender_organization` (inscribed and normalized), `receiver_organization` (inscribed and normalized),  fields for entities mentioned (persons, places, organizations, ships; inscribed and normalized).