RISE Humanities Data Benchmark, 0.5.0-pre1

Benchmark Results

Business Letters

Tests models on extracting structured metadata from historical correspondence, including person names, organizations, dates, locations, and other contextual information from 20th century Swiss historical letters.

Dataset Description Result Overview Test Runs

This benchmark has been run 382 times. It uses f1_macro metric.

Overview

Tested providers: mistral, openai, openrouter, x-ai, alibaba, genai, scicore, anthropic

Tested models: ministral-14b-2512, qwen3.5-397b-a17b, gemini-2.0-pro-exp-02-05, pixtral-large-2411, claude-opus-4-5-20251101, o3, gpt-4.1-mini, claude-sonnet-4-6, x-ai/grok-4, magistral-small-2509, qwen/qwen3-vl-8b-thinking, qwen3.5-35b-a3b, qwen/qwen3-vl-8b-instruct, qwen/qwen3-vl-30b-a3b-instruct, gemini-1.5-pro, pixtral-12b, gemini-2.5-pro-exp-03-25, gemini-2.0-flash, claude-opus-4-1-20250805, qwen3.5-flash-2026-02-23, gemini-2.5-flash, gpt-5, magistral-medium-2509, gpt-5-mini, gemini-3.1-flash-lite-preview, gemini-1.5-flash, ministral-8b-2512, gemini-3-flash-preview, claude-3-opus-20240229, gemini-2.5-flash-preview-09-2025, mistral-small-2506, gpt-4.1, gpt-4o-mini, grok-4.20-0309-reasoning, claude-3-5-sonnet-20241022, gpt-4.5-preview, claude-opus-4-20250514, gemini-2.5-pro, GLM-4.5V-FP8, mistral-large-2512, gpt-5.3-codex, claude-sonnet-4-20250514, qwen3.5-plus-2026-02-15, qwen3.5-27b, claude-3-7-sonnet-20250219, qwen3.5-122b-a10b, gpt-4o, gemini-3.1-pro-preview, gpt-4.1-nano, meta-llama/llama-4-maverick, gemini-exp-1206, gemini-2.5-flash-lite, gpt-5-nano, gemini-2.5-flash-lite-preview-09-2025, claude-haiku-4-5-20251001, mistral-medium-2505, gemini-3-pro-preview, mistral-large-2411, gpt-5.2-2025-12-11, gpt-5.4-2026-03-05, mistral-medium-2508, claude-opus-4-6, gemini-2.0-flash-lite, gpt-5.1-2025-11-13, claude-sonnet-4-5-20250929

Last 5 Runs

ScoreDateProviderModel
0.005 days agomistralministral-8b-2512
0.005 days agomistralministral-8b-2512
57.001 week agoalibabaqwen3.5-flash-2026-02-23
55.001 week agoalibabaqwen3.5-35b-a3b
58.001 week agoalibabaqwen3.5-397b-a17b

All test runs

Contributors

RoleContributors
Domain expertEric Decker, Maximilian Hindermann, Lea Kasper
Data curatorAnthea Alberto, Eric Decker, Maximilian Hindermann
AnnotatorAnthea Alberto, Eric Decker, Pema Frick, Maximilian Hindermann, Lea Kasper, José Luis Losada Palenzuela, Sorin Marti, Elena Spadini
AnalystMaximilian Hindermann
EngineerMaximilian Hindermann

Tags
  • Type(s): letter
  • Benchmark task(s):  information-extraction
  • Writing: typed, handwritten
  • Source creation (century): 20
  • Source Layout: prose
  • Language(s): de