RISE Humanities Data Benchmark, 0.5.0-pre1

Benchmark Results

Library Cards

A comprehensive benchmark focused on catalog card analysis and information extraction from historical library catalog systems. This benchmark evaluates models on structured data extraction from digitized catalog cards, testing their ability to parse complex bibliographic information, author names, dates, and hierarchical catalog structures from historical Swiss library records.

Dataset Description Result Overview Test Runs

This benchmark has been run 117 times. It uses f1_macro metric.

Overview

Tested providers: mistral, openrouter, x-ai, alibaba, genai, anthropic, scicore, openai

Tested models: ministral-14b-2512, qwen3.5-397b-a17b, pixtral-large-2411, claude-opus-4-5-20251101, o3, gpt-4.1-mini, claude-sonnet-4-6, x-ai/grok-4, magistral-small-2509, qwen/qwen3-vl-8b-thinking, qwen3.5-35b-a3b, qwen/qwen3-vl-8b-instruct, qwen/qwen3-vl-30b-a3b-instruct, pixtral-12b, gemini-2.0-flash, claude-opus-4-1-20250805, qwen3.5-flash-2026-02-23, gemini-2.5-flash, gpt-5, magistral-medium-2509, gpt-5-mini, gemini-3.1-flash-lite-preview, ministral-8b-2512, gemini-3-flash-preview, claude-3-opus-20240229, gemini-2.5-flash-preview-09-2025, mistral-small-2506, gpt-4.1, gpt-4o-mini, grok-4.20-0309-reasoning, claude-3-5-sonnet-20241022, claude-opus-4-20250514, gemini-2.5-pro, mistral-large-2512, GLM-4.5V-FP8, gpt-5.3-codex, qwen3.5-27b, claude-sonnet-4-20250514, qwen3.5-plus-2026-02-15, qwen3.5-122b-a10b, claude-3-7-sonnet-20250219, gpt-4o, gemini-3.1-pro-preview, gpt-4.1-nano, meta-llama/llama-4-maverick, gemini-2.5-flash-lite, gpt-5-nano, gemini-2.5-flash-lite-preview-09-2025, claude-haiku-4-5-20251001, mistral-medium-2505, gemini-3-pro-preview, mistral-large-2411, gpt-5.2-2025-12-11, gpt-5.4-2026-03-05, mistral-medium-2508, claude-opus-4-6, gemini-2.0-flash-lite, gpt-5.1-2025-11-13, claude-sonnet-4-5-20250929

Last 5 Runs

ScoreDateProviderModel
38.611 week agoalibabaqwen3.5-flash-2026-02-23
86.851 week agoalibabaqwen3.5-27b
83.801 week agoalibabaqwen3.5-35b-a3b
85.291 week agoalibabaqwen3.5-122b-a10b
88.251 week agoalibabaqwen3.5-397b-a17b

All test runs

Contributors

RoleContributors
Domain expertGabriel Müller
Data curatorGabriel Müller
AnnotatorMaximilian Hindermann, Gabriel Müller
AnalystMaximilian Hindermann
EngineerMaximilian Hindermann

Tags
  • Type(s): index-card
  • Benchmark task(s):  information-extraction
  • Writing: typed, printed, handwritten
  • Source creation (century): 20
  • Source Layout: index
  • Language(s): n/a