Tests models on extracting structured metadata from historical correspondence, including person names, organizations, dates, locations, and other contextual information from 20th century Swiss historical letters.
Dataset Description Result Overview Test Runs
This benchmark has been run 421 times. It uses f1_macro metric.
Tested providers: x-ai, openrouter, genai, openai, anthropic, mistral, alibaba, scicore
Tested models: claude-opus-4-6, qwen3.5-35b-a3b, gpt-5-nano, gemini-2.0-flash-lite, qwen/qwen3.5-flash-02-23, meta-llama/llama-4-maverick, gpt-4o-mini, qwen/qwen3-vl-8b-instruct, gpt-5.4-2026-03-05, gemini-2.5-pro, qwen/qwen3-vl-8b-thinking, gpt-5-mini, gpt-4.1, magistral-small-2509, qwen/qwen3.5-35b-a3b, qwen/qwen3.5-397b-a17b, gemini-exp-1206, gpt-4.1-nano, google/gemma-4-26b-a4b-it, qwen/qwen3.5-122b-a10b, mistral-small-2506, claude-opus-4-20250514, gemini-2.5-flash-lite, mistral-medium-2508, claude-opus-4-5-20251101, gpt-5.5-2026-04-23, gemini-3.1-pro-preview, gpt-5, gemini-2.0-pro-exp-02-05, qwen/qwen3.5-9b, qwen/qwen3-vl-30b-a3b-instruct, claude-3-5-sonnet-20241022, gemini-3-flash-preview, google/gemma-4-31b-it, x-ai/grok-4, gpt-5.1-2025-11-13, claude-haiku-4-5-20251001, gemini-2.0-flash, claude-sonnet-4-5-20250929, grok-4.20-0309-reasoning, pixtral-12b, claude-opus-4-1-20250805, qwen/qwen3.5-27b, gemini-3.1-flash-lite-preview, gpt-4.5-preview, gemini-2.5-flash, qwen3.5-plus-2026-02-15, gpt-4o, GLM-4.5V-FP8, qwen3.5-397b-a17b, mistral-large-2411, claude-3-7-sonnet-20250219, mistral-medium-2505, o3, gemini-2.5-flash-preview-09-2025, ministral-14b-2512, gemini-2.5-pro-exp-03-25, qwen3.5-122b-a10b, gemini-2.5-flash-lite-preview-09-2025, gemini-1.5-flash, magistral-medium-2509, gpt-5.2-2025-12-11, qwen3.5-27b, claude-sonnet-4-20250514, claude-opus-4-7, qwen3.5-flash-2026-02-23, gpt-4.1-mini, claude-sonnet-4-6, qwen/qwen3.6-plus, pixtral-large-2411, mistral-large-2512, gemini-1.5-pro, ministral-8b-2512, gpt-5.3-codex, gemini-3-pro-preview, claude-3-opus-20240229, qwen/qwen3.5-plus-02-15
| Score | Date | Provider | Model |
|---|---|---|---|
| 61.00 | 3 weeks ago | openai | gpt-5.5-2026-04-23 |
| 59.00 | 3 weeks ago | openai | gpt-5.5-2026-04-23 |
| 54.00 | 3 weeks ago | openai | gpt-5.5-2026-04-23 |
| 55.00 | 1 month ago | openrouter | qwen/qwen3.5-9b |
| 51.00 | 1 month ago | openrouter | qwen/qwen3.5-9b |
| Role | Contributors |
|---|---|
| Domain expert | Eric Decker, Maximilian Hindermann, Lea Kasper |
| Data curator | Anthea Alberto, Eric Decker, Maximilian Hindermann |
| Annotator | Anthea Alberto, Eric Decker, Pema Frick, Maximilian Hindermann, Lea Kasper, José Luis Losada Palenzuela, Sorin Marti, Elena Spadini |
| Analyst | Maximilian Hindermann |
| Engineer | Maximilian Hindermann |