RISE Humanities Data Benchmark

A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.

A test run includes:

Prompt and role definition – what the model was asked to do and from what perspective (e.g. “as a historian”).
Model configuration – provider, model version, temperature, and other generation parameters.
Results – the model’s actual response and its evaluation (scores such as F1 or accuracy).
Usage and cost data – token counts and calculated API costs.
Metadata – information like the test date, benchmark name, and person who executed it.

Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.

Result 31 of 97

Test T1006 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	google/gemma-4-26b-a4b-it-20260403

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
98.21	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 33.0K IT + 33.7K OT = 66.7K TT

Cost: 0.003$ + 0.012$ = 0.014$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 32 of 97

Test T0954 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	qwen/qwen3.5-397b-a17b-20260216

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
93.84	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 32.2K IT + 410.9K OT = 443.0K TT

Cost: 0.013$ + 0.961$ = 0.974$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 33 of 97

Test T0928 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	qwen/qwen3.5-27b-20260224

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
95.90	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 1.3M OT = 1.4M TT

Cost: 0.008$ + 2.066$ = 2.074$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 34 of 97

Test T0902 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	qwen/qwen3.6-plus-04-02

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
93.52	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 38.5K IT + 572.4K OT = 610.8K TT

Cost: 0.013$ + 1.116$ = 1.129$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 35 of 97

Test T0941 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	qwen/qwen3.5-35b-a3b-20260224

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
59.95	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 37.2K IT + 1.7M OT = 1.8M TT

Cost: 0.006$ + 2.270$ = 2.277$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 36 of 97

Test T0837 at 2026-03-25

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	alibaba
Model	qwen3.5-35b-a3b

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
92.93	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 442.2K OT = 481.4K TT

Cost: 0.010$ + 0.884$ = 0.894$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 37 of 97

Test T0850 at 2026-03-25

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	alibaba
Model	qwen3.5-27b

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
95.84	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 413.5K OT = 452.7K TT

Cost: 0.012$ + 0.992$ = 1.004$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 38 of 97

Test T0752 at 2026-03-25

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	deepseek
Model	deepseek-chat

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
96.39	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.4K IT + 33.8K OT = 73.2K TT

Cost: 0.011$ + 0.014$ = 0.025$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 39 of 97

Test T0863 at 2026-03-25

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	alibaba
Model	qwen3.5-122b-a10b

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
96.00	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 442.9K OT = 482.1K TT

Cost: 0.016$ + 1.417$ = 1.433$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 40 of 97

Test T0889 at 2026-03-25

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	alibaba
Model	qwen3.5-flash-2026-02-23

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
83.62	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 459.8K OT = 499.1K TT

Cost: 0.004$ + 0.184$ = 0.188$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Search Test Runs

Search Results
Show compact results Refine Search New Search

Download JSON Download CSV

Test T1006 at 2026-04-21

Test T0954 at 2026-04-21

Test T0928 at 2026-04-21

Test T0902 at 2026-04-21

Test T0941 at 2026-04-21

Test T0837 at 2026-03-25

Test T0850 at 2026-03-25

Test T0752 at 2026-03-25

Test T0863 at 2026-03-25

Test T0889 at 2026-03-25

Search Test Runs

Search Results Show compact results Refine Search New Search Download Download JSON Download CSV

Test T1006 at 2026-04-21

Test T0954 at 2026-04-21

Test T0928 at 2026-04-21

Test T0902 at 2026-04-21

Test T0941 at 2026-04-21

Test T0837 at 2026-03-25

Test T0850 at 2026-03-25

Test T0752 at 2026-03-25

Test T0863 at 2026-03-25

Test T0889 at 2026-03-25

Search Results
Show compact results Refine Search New Search

Download JSON Download CSV