Tests models on extracting structured metadata from historical correspondence, including person names, organizations, dates, locations, and other contextual information from 20th century Swiss historical letters.
Dataset Description Result Overview Test Runs
This benchmark has been run 382 times. It uses f1_macro metric.
Tested providers: mistral, openai, openrouter, x-ai, alibaba, genai, scicore, anthropic
Tested models: ministral-14b-2512, qwen3.5-397b-a17b, gemini-2.0-pro-exp-02-05, pixtral-large-2411, claude-opus-4-5-20251101, o3, gpt-4.1-mini, claude-sonnet-4-6, x-ai/grok-4, magistral-small-2509, qwen/qwen3-vl-8b-thinking, qwen3.5-35b-a3b, qwen/qwen3-vl-8b-instruct, qwen/qwen3-vl-30b-a3b-instruct, gemini-1.5-pro, pixtral-12b, gemini-2.5-pro-exp-03-25, gemini-2.0-flash, claude-opus-4-1-20250805, qwen3.5-flash-2026-02-23, gemini-2.5-flash, gpt-5, magistral-medium-2509, gpt-5-mini, gemini-3.1-flash-lite-preview, gemini-1.5-flash, ministral-8b-2512, gemini-3-flash-preview, claude-3-opus-20240229, gemini-2.5-flash-preview-09-2025, mistral-small-2506, gpt-4.1, gpt-4o-mini, grok-4.20-0309-reasoning, claude-3-5-sonnet-20241022, gpt-4.5-preview, claude-opus-4-20250514, gemini-2.5-pro, GLM-4.5V-FP8, mistral-large-2512, gpt-5.3-codex, claude-sonnet-4-20250514, qwen3.5-plus-2026-02-15, qwen3.5-27b, claude-3-7-sonnet-20250219, qwen3.5-122b-a10b, gpt-4o, gemini-3.1-pro-preview, gpt-4.1-nano, meta-llama/llama-4-maverick, gemini-exp-1206, gemini-2.5-flash-lite, gpt-5-nano, gemini-2.5-flash-lite-preview-09-2025, claude-haiku-4-5-20251001, mistral-medium-2505, gemini-3-pro-preview, mistral-large-2411, gpt-5.2-2025-12-11, gpt-5.4-2026-03-05, mistral-medium-2508, claude-opus-4-6, gemini-2.0-flash-lite, gpt-5.1-2025-11-13, claude-sonnet-4-5-20250929
| Score | Date | Provider | Model |
|---|---|---|---|
| 0.00 | 5 days ago | mistral | ministral-8b-2512 |
| 0.00 | 5 days ago | mistral | ministral-8b-2512 |
| 57.00 | 1 week ago | alibaba | qwen3.5-flash-2026-02-23 |
| 55.00 | 1 week ago | alibaba | qwen3.5-35b-a3b |
| 58.00 | 1 week ago | alibaba | qwen3.5-397b-a17b |
| Role | Contributors |
|---|---|
| Domain expert | Eric Decker, Maximilian Hindermann, Lea Kasper |
| Data curator | Anthea Alberto, Eric Decker, Maximilian Hindermann |
| Annotator | Anthea Alberto, Eric Decker, Pema Frick, Maximilian Hindermann, Lea Kasper, José Luis Losada Palenzuela, Sorin Marti, Elena Spadini |
| Analyst | Maximilian Hindermann |
| Engineer | Maximilian Hindermann |