RISE Humanities Data Benchmark, 0.5.1

Benchmark Results

Book Advert XML files (malformed) from Avisblatt

Dataset Description Result Overview Test Runs

This benchmark has been run 111 times. It uses fuzzy metric.

Overview

Tested providers: x-ai, openrouter, genai, openai, anthropic, deepseek, mistral, alibaba, scicore

Tested models: deepseek-v4-pro, claude-haiku-4-5-20251001, google/gemma-4-26b-a4b-it, qwen3.5-397b-a17b, qwen/qwen3.5-122b-a10b, gemini-2.0-flash, claude-opus-4-6, mistral-large-2411, qwen3.5-35b-a3b, claude-sonnet-4-5-20250929, mistral-small-2506, claude-3-7-sonnet-20250219, grok-4.20-0309-reasoning, gpt-5-nano, x-ai/grok-4, qwen3-235b-fp8, mistral-medium-2505, gemini-2.0-flash-lite, qwen/qwen3.5-flash-02-23, claude-opus-4-20250514, gemini-2.5-flash-lite, o3, meta-llama/llama-4-maverick, gpt-4o-mini, qwen/qwen3-vl-8b-instruct, gemini-2.5-flash-preview-09-2025, ministral-14b-2512, mistral-medium-2508, gemini-2.5-flash-lite-preview-09-2025, qwen3.5-122b-a10b, pixtral-12b, claude-opus-4-5-20251101, magistral-medium-2509, claude-opus-4-1-20250805, gpt-5.5-2026-04-23, gpt-5.4-2026-03-05, gemini-3.1-pro-preview, gpt-5, deepseek-chat, gemini-2.5-pro, gpt-5.2-2025-12-11, qwen3.5-27b, qwen/qwen3.5-27b, claude-sonnet-4-20250514, gemini-3.1-flash-lite-preview, qwen/qwen3-vl-8b-thinking, gpt-5-mini, claude-opus-4-7, gpt-4.1, magistral-small-2509, qwen3.5-flash-2026-02-23, gemini-2.5-flash, qwen/qwen3.5-35b-a3b, qwen/qwen3.5-397b-a17b, deepseek-reasoner, gpt-4.1-mini, qwen/qwen3-vl-30b-a3b-instruct, qwen3.5-plus-2026-02-15, qwen/qwen3.5-9b, claude-sonnet-4-6, deepseek-v4-flash, qwen/qwen3.6-plus, gpt-4.1-nano, gpt-4o, gemini-3-flash-preview, pixtral-large-2411, mistral-large-2512, ministral-8b-2512, gpt-5.3-codex, GLM-4.5V-FP8, google/gemma-4-31b-it, gemini-3-pro-preview, claude-3-opus-20240229, qwen/qwen3.5-plus-02-15, gpt-5.1-2025-11-13

Last 5 Runs

ScoreDateProviderModel
98.483 weeks agoopenaigpt-5.5-2026-04-23
97.423 weeks agodeepseekdeepseek-v4-pro
97.013 weeks agodeepseekdeepseek-v4-flash
18.804 weeks agoopenrouterqwen/qwen3.5-9b
92.011 month agoopenrouterqwen/qwen3.5-122b-a10b

All test runs

Contributors

RoleContributors
Domain expertIna Serif
Data curatorSorin Marti

Tags
  • Type(s): newspaper-page
  • Benchmark task(s):  data-correction
  • Writing: n/a
  • Source creation (century): 18
  • Source Layout: n/a
  • Language(s): en