RISE Humanities Data Benchmark, 0.5.1

Benchmark Results

Business Letters

Tests models on extracting structured metadata from historical correspondence, including person names, organizations, dates, locations, and other contextual information from 20th century Swiss historical letters.

Dataset Description Result Overview Test Runs

This benchmark has been run 421 times. It uses f1_macro metric.

Overview

Tested providers: x-ai, openrouter, genai, openai, anthropic, mistral, alibaba, scicore

Tested models: claude-opus-4-6, qwen3.5-35b-a3b, gpt-5-nano, gemini-2.0-flash-lite, qwen/qwen3.5-flash-02-23, meta-llama/llama-4-maverick, gpt-4o-mini, qwen/qwen3-vl-8b-instruct, gpt-5.4-2026-03-05, gemini-2.5-pro, qwen/qwen3-vl-8b-thinking, gpt-5-mini, gpt-4.1, magistral-small-2509, qwen/qwen3.5-35b-a3b, qwen/qwen3.5-397b-a17b, gemini-exp-1206, gpt-4.1-nano, google/gemma-4-26b-a4b-it, qwen/qwen3.5-122b-a10b, mistral-small-2506, claude-opus-4-20250514, gemini-2.5-flash-lite, mistral-medium-2508, claude-opus-4-5-20251101, gpt-5.5-2026-04-23, gemini-3.1-pro-preview, gpt-5, gemini-2.0-pro-exp-02-05, qwen/qwen3.5-9b, qwen/qwen3-vl-30b-a3b-instruct, claude-3-5-sonnet-20241022, gemini-3-flash-preview, google/gemma-4-31b-it, x-ai/grok-4, gpt-5.1-2025-11-13, claude-haiku-4-5-20251001, gemini-2.0-flash, claude-sonnet-4-5-20250929, grok-4.20-0309-reasoning, pixtral-12b, claude-opus-4-1-20250805, qwen/qwen3.5-27b, gemini-3.1-flash-lite-preview, gpt-4.5-preview, gemini-2.5-flash, qwen3.5-plus-2026-02-15, gpt-4o, GLM-4.5V-FP8, qwen3.5-397b-a17b, mistral-large-2411, claude-3-7-sonnet-20250219, mistral-medium-2505, o3, gemini-2.5-flash-preview-09-2025, ministral-14b-2512, gemini-2.5-pro-exp-03-25, qwen3.5-122b-a10b, gemini-2.5-flash-lite-preview-09-2025, gemini-1.5-flash, magistral-medium-2509, gpt-5.2-2025-12-11, qwen3.5-27b, claude-sonnet-4-20250514, claude-opus-4-7, qwen3.5-flash-2026-02-23, gpt-4.1-mini, claude-sonnet-4-6, qwen/qwen3.6-plus, pixtral-large-2411, mistral-large-2512, gemini-1.5-pro, ministral-8b-2512, gpt-5.3-codex, gemini-3-pro-preview, claude-3-opus-20240229, qwen/qwen3.5-plus-02-15

Last 5 Runs

ScoreDateProviderModel
61.003 weeks agoopenaigpt-5.5-2026-04-23
59.003 weeks agoopenaigpt-5.5-2026-04-23
54.003 weeks agoopenaigpt-5.5-2026-04-23
55.001 month agoopenrouterqwen/qwen3.5-9b
51.001 month agoopenrouterqwen/qwen3.5-9b

All test runs

Contributors

RoleContributors
Domain expertEric Decker, Maximilian Hindermann, Lea Kasper
Data curatorAnthea Alberto, Eric Decker, Maximilian Hindermann
AnnotatorAnthea Alberto, Eric Decker, Pema Frick, Maximilian Hindermann, Lea Kasper, José Luis Losada Palenzuela, Sorin Marti, Elena Spadini
AnalystMaximilian Hindermann
EngineerMaximilian Hindermann

Tags
  • Type(s): letter
  • Benchmark task(s):  information-extraction
  • Writing: typed, handwritten
  • Source creation (century): 20
  • Source Layout: prose
  • Language(s): de