A comprehensive benchmark focused on catalog card analysis and information extraction from historical library catalog systems. This benchmark evaluates models on structured data extraction from digitized catalog cards, testing their ability to parse complex bibliographic information, author names, dates, and hierarchical catalog structures from historical Swiss library records.
Dataset Description Result Overview Test Runs
This benchmark has been run 117 times. It uses f1_macro metric.
Tested providers: mistral, openrouter, x-ai, alibaba, genai, anthropic, scicore, openai
Tested models: ministral-14b-2512, qwen3.5-397b-a17b, pixtral-large-2411, claude-opus-4-5-20251101, o3, gpt-4.1-mini, claude-sonnet-4-6, x-ai/grok-4, magistral-small-2509, qwen/qwen3-vl-8b-thinking, qwen3.5-35b-a3b, qwen/qwen3-vl-8b-instruct, qwen/qwen3-vl-30b-a3b-instruct, pixtral-12b, gemini-2.0-flash, claude-opus-4-1-20250805, qwen3.5-flash-2026-02-23, gemini-2.5-flash, gpt-5, magistral-medium-2509, gpt-5-mini, gemini-3.1-flash-lite-preview, ministral-8b-2512, gemini-3-flash-preview, claude-3-opus-20240229, gemini-2.5-flash-preview-09-2025, mistral-small-2506, gpt-4.1, gpt-4o-mini, grok-4.20-0309-reasoning, claude-3-5-sonnet-20241022, claude-opus-4-20250514, gemini-2.5-pro, mistral-large-2512, GLM-4.5V-FP8, gpt-5.3-codex, qwen3.5-27b, claude-sonnet-4-20250514, qwen3.5-plus-2026-02-15, qwen3.5-122b-a10b, claude-3-7-sonnet-20250219, gpt-4o, gemini-3.1-pro-preview, gpt-4.1-nano, meta-llama/llama-4-maverick, gemini-2.5-flash-lite, gpt-5-nano, gemini-2.5-flash-lite-preview-09-2025, claude-haiku-4-5-20251001, mistral-medium-2505, gemini-3-pro-preview, mistral-large-2411, gpt-5.2-2025-12-11, gpt-5.4-2026-03-05, mistral-medium-2508, claude-opus-4-6, gemini-2.0-flash-lite, gpt-5.1-2025-11-13, claude-sonnet-4-5-20250929
| Score | Date | Provider | Model |
|---|---|---|---|
| 38.61 | 1 week ago | alibaba | qwen3.5-flash-2026-02-23 |
| 86.85 | 1 week ago | alibaba | qwen3.5-27b |
| 83.80 | 1 week ago | alibaba | qwen3.5-35b-a3b |
| 85.29 | 1 week ago | alibaba | qwen3.5-122b-a10b |
| 88.25 | 1 week ago | alibaba | qwen3.5-397b-a17b |
| Role | Contributors |
|---|---|
| Domain expert | Gabriel Müller |
| Data curator | Gabriel Müller |
| Annotator | Maximilian Hindermann, Gabriel Müller |
| Analyst | Maximilian Hindermann |
| Engineer | Maximilian Hindermann |