RISE Humanities Data Benchmark, 0.5.0-pre1

Benchmark Results

Personnel Cards

Evaluates models' ability to transcribe and interpret personnel index cards of Swiss federal employees (1941–1961), containing typed and handwritten entries on job title, work location, pay grade, salary, and related notes in German and French.

Dataset Description Result Overview Test Runs

This benchmark has been run 59 times. It uses f1_micro metric.

Overview

Tested providers: mistral, openai, openrouter, x-ai, alibaba, genai, cohere, anthropic

Tested models: ministral-14b-2512, qwen3.5-397b-a17b, pixtral-large-2411, claude-opus-4-5-20251101, o3, gpt-4.1-mini, claude-sonnet-4-6, x-ai/grok-4, magistral-small-2509, qwen/qwen3-vl-8b-thinking, qwen3.5-35b-a3b, qwen/qwen3-vl-8b-instruct, qwen/qwen3-vl-30b-a3b-instruct, gemini-2.0-flash, claude-opus-4-1-20250805, qwen3.5-flash-2026-02-23, gemini-2.5-flash, gpt-5, magistral-medium-2509, gpt-5-mini, gemini-3.1-flash-lite-preview, command-a-vision-07-2025, ministral-8b-2512, gemini-3-flash-preview, gemini-2.5-flash-preview-09-2025, mistral-small-2506, gpt-4.1, gpt-4o-mini, grok-4.20-0309-reasoning, gemini-2.5-pro, mistral-large-2512, claude-opus-4-20250514, gpt-5.3-codex, qwen3.5-27b, qwen3.5-122b-a10b, claude-sonnet-4-20250514, qwen3.5-plus-2026-02-15, gpt-4o, gemini-3.1-pro-preview, gpt-4.1-nano, meta-llama/llama-4-maverick, gemini-2.5-flash-lite, gemini-2.5-flash-lite-preview-09-2025, claude-haiku-4-5-20251001, gpt-5-nano, mistral-medium-2505, gemini-3-pro-preview, gpt-5.2-2025-12-11, mistral-large-2411, gpt-5.4-2026-03-05, claude-opus-4-6, mistral-medium-2508, gemini-2.0-flash-lite, gpt-5.1-2025-11-13, claude-sonnet-4-5-20250929

Last 5 Runs

ScoreDateProviderModel
96.671 week agoalibabaqwen3.5-397b-a17b
85.551 week agoalibabaqwen3.5-flash-2026-02-23
96.511 week agoalibabaqwen3.5-35b-a3b
97.691 week agoalibabaqwen3.5-122b-a10b
96.961 week agoalibabaqwen3.5-27b

All test runs

Contributors

RoleContributors
Domain experttabea_wullschleger
Data curatortabea_wullschleger
Annotatortabea_wullschleger
AnalystMaximilian Hindermann, tabea_wullschleger
EngineerMaximilian Hindermann

Tags
  • Type(s): index-card
  • Benchmark task(s):  transcription, document-understanding, data-correction
  • Writing: handwritten, typed, printed
  • Source creation (century): 20
  • Source Layout: table, form
  • Language(s): de, fr