Evaluates models on page segmentation and handwritten text extraction from 15th century medieval manuscripts written in late medieval German. Tests the ability to transcribe historical handwriting, identify folio numbers, distinguish main text from marginal additions, and maintain historical spelling and formatting. Performance is measured using fuzzy string matching and Character Error Rate (CER).
Dataset Description Result Overview Test Runs
This benchmark has been run 105 times. It uses cer metric.
Tested providers: openai, anthropic, x-ai, genai, mistral, openrouter, scicore, alibaba
Tested models: gemini-2.5-pro, qwen3.5-122b-a10b, mistral-small-2506, google/gemma-4-31b-it, qwen3.5-397b-a17b, gpt-5.1-2025-11-13, pixtral-12b, gpt-5.3-codex, o3, claude-opus-4-5-20251101, gemini-3-pro-preview, grok-4.20-0309-reasoning, qwen3.5-plus-2026-02-15, gemini-2.5-flash-lite-preview-09-2025, claude-3-7-sonnet-20250219, mistral-large-2512, qwen/qwen3.5-397b-a17b, magistral-medium-2509, gemini-3.1-pro-preview, magistral-small-2509, gemini-3-flash-preview, gemini-2.0-flash, gemini-2.5-flash-lite, qwen/qwen3.5-122b-a10b, gpt-5.2-2025-12-11, claude-opus-4-7, qwen3.5-flash-2026-02-23, qwen/qwen3-vl-8b-instruct, gpt-4.1, claude-sonnet-4-5-20250929, claude-opus-4-6, qwen3.5-35b-a3b, mistral-medium-2508, ministral-14b-2512, gpt-4o-mini, claude-haiku-4-5-20251001, claude-3-opus-20240229, gpt-5-mini, ministral-8b-2512, claude-sonnet-4-20250514, gemini-2.0-flash-lite, qwen/qwen3-vl-30b-a3b-instruct, gpt-5-nano, meta-llama/llama-4-maverick, claude-opus-4-20250514, gpt-5, gemini-3.1-flash-lite-preview, qwen/qwen3.5-27b, gpt-4.1-nano, gemini-2.5-flash-preview-09-2025, claude-sonnet-4-6, gpt-4.1-mini, qwen/qwen3-vl-8b-thinking, claude-3-5-sonnet-20241022, gpt-5.4-2026-03-05, gpt-4o, qwen3.5-27b, qwen/qwen3.5-35b-a3b, pixtral-large-2411, google/gemma-4-26b-a4b-it, qwen/qwen3.5-9b, qwen/qwen3.5-plus-02-15, GLM-4.5V-FP8, qwen/qwen3.5-flash-02-23, mistral-large-2411, mistral-medium-2505, claude-opus-4-1-20250805, qwen/qwen3.6-plus, gpt-5.5-2026-04-23, gemini-2.5-flash, x-ai/grok-4
| Score | Date | Provider | Model |
|---|---|---|---|
| 71.10 | 3 weeks ago | openai | gpt-5.5-2026-04-23 |
| 62.30 | 3 weeks ago | openrouter | qwen/qwen3.5-9b |
| 71.70 | 4 weeks ago | openrouter | qwen/qwen3.5-397b-a17b |
| 73.90 | 4 weeks ago | openrouter | google/gemma-4-31b-it |
| 75.40 | 4 weeks ago | openrouter | qwen/qwen3.5-122b-a10b |
| Role | Contributors |
|---|---|
| Domain expert | Ina Serif |
| Data curator | Ina Serif |
| Annotator | Ina Serif |
| Analyst | Maximilian Hindermann |
| Engineer | Maximilian Hindermann, Ina Serif |