RISE Humanities Data Benchmark, 0.4.0

Company Lists

Overview

This benchmark evaluates the performance of large language models on extracting structured company information (company name and location) from historical trade indexes. The benchmark consists of 16 pages from trade indexes, issued between 1921 and 1958 by the British Chamber of Commerce for Switzerland, preserved at the Swiss Economic Archives in Basel. The complete record comprises 14 single trade indexes. These sources consist of lists of chamber members, accompanied by additional details such as addresses, locations, and-in later years- telephone numbers and telegraphic codes. Each volume contains two distinct types of lists: first, an alphabetical directory indicating the company name and the page reference within the index, and second, the actual trade index, which provides full entries containing the relevant commercial information. The layout of these indexes varies considerably over time, ranging from multi-column tables to more linear list formats. 

The objective of this benchmark was to extract those data fields that occur consistently across all entries, namely the company name and the location. The task required the model to identify these elements across differing typographic layouts and information densities, and to classify them into a structured format suitable for further processing.

Dataset Description

Data TypeImages (JPG, 1868x2931, ~350 KB each)
Amount15 images
Originhttps://doi.org/10.7891/e-manuscripta-174832
SignatureUBW Zo 1394
LanguageEnglish, (German)
Content DescriptionCollection of Trade indexes from the British Chamber of Commerce for Switzerland
Time Period1921 - 1958
LicensePublic Domain Mark, PDM 1.0
Tags

book-pages, list-like, printed-source, century-20th, company-entries

RoleContributors
Domain expertlea_kasper
Data curatorlea_kasper
Annotatorlea_kasper, sorin_marti
Analystlea_kasper
Engineersorin_marti

Image

Ground Truth

{
   "page_id": "156089_1321092_18",
   "entries": [
       {
           "entry_id": "156089_1321092_18-1",
           "company_name": "Färbereien Schetty, A.-G.",
           "location": "Basle"
       },
       {
           "entry_id": "156089_1321092_18-2",
           "company_name": "G. Kappeler A.-G.",
           "location": "Zofingen"
       },
       {
           "entry_id": "156089_1321092_18-3",
           "company_name": "J. A. Crabtree & Co., Ltd.",
           "location": "Walsall, England"
       }, ...

class TypeValuePair(BaseModel):
   type: str
   value: str
 
class Entry(BaseModel):
   entry_id: str
   company_name: str
   location: str
   additional_information: Optional[List[TypeValuePair]] = None
 
class ListPage(BaseModel):
   page_id: str
   entries: List[Entry]

Scoring

As both ground truth and llm-response are JSON, scoring is simple. Each ground truth field is fuzzy-matched to each response field for each request. The average of those values constitutes for the request-score. The average of all request-scores is the test-score (0-1, normalized: 0-100) 

Observations

Overall, the extraction process performed consistently well. The model proved capable of adapting to heterogeneous layouts and isolating the relevant entities with a high degree of accuracy. The extraction of additional attributes—such as addresses, telephone numbers, or product descriptions—was intentionally omitted, as these elements are not uniformly available across all records. Moreover, for the research objectives of this project, the company name and location are of primary analytical importance, as they provide the basis for georeferencing the firms in a subsequent step and thereby enable geospatial analysis of the chamber’s membership over time.