This benchmark evaluates the performance of large language models on extracting structured company information from historical index cards. The benchmark consists of 34 pages from an archival record preserved at the Swiss Federal Archives, which documents Swiss companies blacklisted by the British authorities during the Second World War. The complete record comprises over 1000 index cards, each following a consistent structure divided into three sections: the company name, an identification number, and two dates—the first indicating when the company was added to the blacklist, and the second when it was removed.
| Data Type | Images |
|---|---|
| Amount | 33 cards |
| Origin | https://www.recherche.bar.admin.ch/recherche/#/de/archiv/einheit/31240458 |
| Signature | E2001E#1968/78#7508* |
| Language | German |
| Content Description | Index cards for companies which are on a british black list for trade (1040s) |
| Time Period | 1940s |
| License | Free use |
| Tags |
|
The ground truth has been created by defining the dataclass structure and running the benchmark. After that, each item has been carefully reviewed and corrected. Commas after names and periods after locations have been ommited.
Example ground truth file with all possible values present.
{
"b_id": {
"transcription": "B.51.322.GB.638"
},
"company": {
"transcription": "ARIA, Automobil-Reifen-Import A.G."
},
"date": "1942-08-06",
"information": [
{
"transcription": "Britische schwarze Liste\nAmendment 12, Juli 1942."
},
{
"transcription": "gestrichen:\nAmendment 13, 23. November 1945."
}
],
"location": {
"transcription": "Zürich"
}
}This dataclass defines the desired output structure which is the same as the ground truths structure. Only company names, ids and locations are always present and thus required. The rest of the fields are optional.
class Entry(BaseModel):
transcription: str
class Company(BaseModel):
transcription: str
class BID(BaseModel):
transcription: str
class Location(BaseModel):
transcription: str
class Card(BaseModel):
company: Company
location: Location
b_id: BID
date: Optional[str] = None
information: Optional[List[Entry]] = NoneAs both ground truth and llm-response are JSON, scoring is simple. Each ground truth field is fuzzy-matched to each response field for each request. The average of those values constitutes for the request-score. The average of all request-scores is the test-score (0-1, normalized: 0-100)
The task required the extraction and differentiation of key structured information from each index card. Specifically, the model was expected to identify and correctly separate the company name and the location, which appear together in e region, to extract the identification number, and to transcribe the two information lines containing the dates indicating when the company was added to and removed from the blacklist. These two information lines were captured as transcribed text, without further normalization or structuring of the dates.
Some cards also contained handwritten annotations, which were excluded form the benchmark as they were infrequent, inconsistent, and not relevant to the extraction task. In most cases, these notes merely referred to supplementary materials.
Overall, the task performed well, demonstrating that large language models are capable of reliably extracting and distinguishing the main textual elements. The transcribed date fields, which have not yet been standardized, can be normalized in a subsequent processing step, enabling their integration into structured dates for further analysis.