RISE Humanities Data Benchmark, 0.5.1-pre1

Benchmark Results

Medieval Manuscripts

Evaluates models on page segmentation and handwritten text extraction from 15th century medieval manuscripts written in late medieval German. Tests the ability to transcribe historical handwriting, identify folio numbers, distinguish main text from marginal additions, and maintain historical spelling and formatting. Performance is measured using fuzzy string matching and Character Error Rate (CER).

Dataset Description Result Overview Test Runs

This benchmark has been run 105 times. It uses cer metric.

Overview

Tested providers: openai, anthropic, x-ai, genai, mistral, openrouter, scicore, alibaba

Tested models: gemini-2.5-pro, qwen3.5-122b-a10b, mistral-small-2506, google/gemma-4-31b-it, qwen3.5-397b-a17b, gpt-5.1-2025-11-13, pixtral-12b, gpt-5.3-codex, o3, claude-opus-4-5-20251101, gemini-3-pro-preview, grok-4.20-0309-reasoning, qwen3.5-plus-2026-02-15, gemini-2.5-flash-lite-preview-09-2025, claude-3-7-sonnet-20250219, mistral-large-2512, qwen/qwen3.5-397b-a17b, magistral-medium-2509, gemini-3.1-pro-preview, magistral-small-2509, gemini-3-flash-preview, gemini-2.0-flash, gemini-2.5-flash-lite, qwen/qwen3.5-122b-a10b, gpt-5.2-2025-12-11, claude-opus-4-7, qwen3.5-flash-2026-02-23, qwen/qwen3-vl-8b-instruct, gpt-4.1, claude-sonnet-4-5-20250929, claude-opus-4-6, qwen3.5-35b-a3b, mistral-medium-2508, ministral-14b-2512, gpt-4o-mini, claude-haiku-4-5-20251001, claude-3-opus-20240229, gpt-5-mini, ministral-8b-2512, claude-sonnet-4-20250514, gemini-2.0-flash-lite, qwen/qwen3-vl-30b-a3b-instruct, gpt-5-nano, meta-llama/llama-4-maverick, claude-opus-4-20250514, gpt-5, gemini-3.1-flash-lite-preview, qwen/qwen3.5-27b, gpt-4.1-nano, gemini-2.5-flash-preview-09-2025, claude-sonnet-4-6, gpt-4.1-mini, qwen/qwen3-vl-8b-thinking, claude-3-5-sonnet-20241022, gpt-5.4-2026-03-05, gpt-4o, qwen3.5-27b, qwen/qwen3.5-35b-a3b, pixtral-large-2411, google/gemma-4-26b-a4b-it, qwen/qwen3.5-9b, qwen/qwen3.5-plus-02-15, GLM-4.5V-FP8, qwen/qwen3.5-flash-02-23, mistral-large-2411, mistral-medium-2505, claude-opus-4-1-20250805, qwen/qwen3.6-plus, gpt-5.5-2026-04-23, gemini-2.5-flash, x-ai/grok-4

Last 5 Runs

ScoreDateProviderModel
71.103 weeks agoopenaigpt-5.5-2026-04-23
62.303 weeks agoopenrouterqwen/qwen3.5-9b
71.704 weeks agoopenrouterqwen/qwen3.5-397b-a17b
73.904 weeks agoopenroutergoogle/gemma-4-31b-it
75.404 weeks agoopenrouterqwen/qwen3.5-122b-a10b

All test runs

Contributors

RoleContributors
Domain expertIna Serif
Data curatorIna Serif
AnnotatorIna Serif
AnalystMaximilian Hindermann
EngineerMaximilian Hindermann, Ina Serif

Tags
  • Type(s): manuscript
  • Benchmark task(s):  transcription
  • Writing: handwritten
  • Source creation (century): 15
  • Source Layout: prose
  • Language(s): de